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This application is a divisional application of application Ser. No. 08/903,2 17 filed July 
20, 1997, to be issued on May 29, 2001 as U.S. Patent No. 6,240,374, which was a 
continuation-in-part of application Ser. No. 08/657,147 filed June 3, 1996 which was a 
continuation-in-part of application Ser. No. 08/592,132 filed January 26, 1996 which issued 
5 February 6, 2001 as U.S. Patent No. 6, 185,506 entitled A Method For Selecting An Optimally 
Diverse Library Of Small Molecules Based On Validated Molecular Structural Descriptors. 

A portion of the disclosure of this patent document contains material which is subject 
to copyright protection. The copyright owner has no objection to the facsimile reproduction 
by anyone of the patent document or the patent disclosure, as it appears in the U.S. Patent and 
10 Trademark Office but otherwise reserves all copyright rights whatsoever. 
Technical Field 

This invention relates to the field of molecular structure/activity analysis and more 
specifically to: 1) a method of validating molecular structural descriptors; 2) a method using 
validated molecular descriptors to design an optimally diverse combinatorial screening library; 

15 3) a method of merging libraries derived from different combinatorial chemistries; 4) a method 
using validated molecular descriptors of generating a searchable virtual library of molecules 
which can be combinatorially derived; 5) methods of searching the virtual library for 
combinatorially derived product molecules which meet specified criteria; and 6) methods of 
following up and optimizing identified leads. The screening libraries designed by the methods 

20 of this invention are constructed to ensure that an optimal structural diversity of compounds 
is represented. The search methods of the invention ensure that the same diversity space is not 
oversampled and that compounds can be identified having a high likelihood of possessing the 
same structure and/or activity of a lead compound. In particular, the invention describes the 
design of libraries of small molecules to be used for pharmacological testing. 

25 Background Art 

Statement Of The Problem 

While the present invention is discussed with detailed reference to the search for and 
identification of pharmacologically useful chemical compounds, the invention is applicable to 
any attempt to search for and identify chemical compounds which have some desired physical 
30 or chemical characteristic(s). The broader teachings of this invention are easily recognized if 
a different functional utility or useful property describing other chemical systems is substituted 
below for the term "biological activity". 




2 

Starting with the serendipitous discovery of penicillin by Fleming and the subsequent 
directed searches for additional antibiotics by Waksman and Dubos, the field of drug discovery 
during the post World War II era has been driven by the belief that nature would provide many 
needed drugs if only a careful and diligent search for them was conducted. Consequently, 
5 pharmaceutical companies undertook massive screening programs which tested samples of 
natural products (typically isolated from soil or plants) for their biological properties. In a 
parallel effort to increase the effectiveness of the discovered "lead" compounds, medicinal 
chemists learned to synthesize derivatives and analogs of the compounds. Over the years, as 
biochemists identified new enzymes and biological reactions, large scale screening continued 

10 as compounds were tested for biological activity in an ever rapidly expanding number of 
biochemical pathways. However, proportionately fewer and fewer lead compounds possessing 
a desired therapeutic activity have been discovered. In an attempt to extend the range of 
compounds available for testing, during the last few years the search for unique biological 
materials has been extended to all comers of the earth including sources from both the tropical 

15 rain forests and the ocean. Despite these and other efforts, it is estimated that discovery and 
development of each new drug still takes about 12 years and costs on the order of 350 million 
dollars. 

Beginning approximately twenty-five years ago, as bioscientists learned more about the 
chemical and stereochemical requirements for biological interactions, a variety of semi- 

20 empirical, theoretical, and quantitative approaches to drug design were developed. These 
approaches were accelerated by the availability of powerful computers to perform 
computational chemistry. It was hoped that the era of "rational drug design" would shorten the 
time between significant discoveries and also provide an approach to discovering compounds 
active in biological pathways for which no drugs had yet been discovered. In large part, this 

25 work was based on the accumulated observation of medicinal chemists that compounds which 
were structurally similar also possessed similar biological activities. While significant strides 
were made using this approach, it too, like the mass screening programs, failed to provide a 
solution to the problem of rapidly discovering new compounds with activities in the ever 
increasing number of biological pathways being elucidated by modem biotechnology. 

30 During the past four or five years, a revised screening approach has been under 

development which, it was hoped, would accelerate the pace of drug discovery. In fact, the 
approach has been remarkably successful and represents one of the most active areas in 
biotechnology today. This new approach utilizes combinatorial libraries against which 
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biological assays are screened. Combinatorial libraries are collections of niolecules generated 
by synthetic pathways in which either: 1) two groups of reactants are combined to form 
products; or 2) one or more positions on core molecules are substituted by a different chemical 
constituent/moiety selected from a large number of possible constituents. 
5 Two fundamental ideas underlie combinatorial screening libraries. The first idea, 

common to all drug research, is that somewhere amongst the diversity of all possible chemical 
structures there exist molecules which have the appropriate shape and binding properties to 
interact with any biological system. The second idea is the belief that synthesizing and testing 
many molecules in parallel is a more efficient way (in terms of time and cost) to find a 
10 molecule possessing a desired activity than the random testing of compounds, no matter what 
their source. In the broadest context, these ideas require that, since the binding requirements 
of a ligand to the biological systems under study (enzymes, membranes, receptors, antibodies, 
whole cell preparations, genetic materials, etc.) are not known, the screened compounds should 
possess as broad a range of characteristics (chemical and physical) as possible in order to 
15 increase the likelihood of finding one that is appropriate for any given biological target. This 
requirement for a screening library is reflected in the term "diversity" - essentially a way of 
suggesting that the library should contain as great a dissimilarity of compounds as possible. 

However, as is immediately apparent, a combinatorial approach to synthesizing 
molecules generates an immense number of compounds many with a high degree of structural 
20 similarity. In fact, the number of compounds synthetically accessible with known organic 
reactions exceeds by many orders of magnitude the numbers which can actually be made and 
tested. One area where these ideas were first explored is in the design of pepfide libraries. For 
a library of five member peptides synthesized using the 20 naturally occurring amino acids, 
3,200,000, (20^) different peptides may be constructed. The number of combinatorial 
25 possibilities increases even more dramatically when non-peptide combinatorial libraries are 
considered. With non-peptide libraries, the whole synthetic chemical universe of combinatorial 
possibilities is available. Library sizes ranging from 5 X 10^ to 4 X 10'^ molecules are now 
being discussed. The enormous universe of chemical compounds is both a blessing and a curse 
to medicinal chemists seeking new drugs. On the one hand, if a molecule exists with the 
30 desired biological activity, it should be included in the chemical universe. On the other hand, 
it may be impossible to find. Thus, the principal focus of recent efforts has been to define 
smaller screening subsets of molecules derivable from accessible combinatorial syntheses 
without losing the inherent diversity of an accessible universe. 




4 

To date, in order to narrow the focus of the search and reduce the number of 
compounds to be screened, attention has been directed to designing biologically specific 
libraries. Thus, many combinatorial screening libraries existing in the prior art have been 
designed based on prior knowledge about a particular biological system such as a known 
5 pharmacophore (a geometric arrangement of structural fragments abstracted from molecular 
structures known to have activity). Even with this knowledge, molecules are included in these 
prior art libraries based on intuition - "seat of the pants" estimations of likely similarity based 
on an intuitive "feel" for the systems under study. This procedure is essentially pseudo-random 
screening, not rational library design. Several biotechnology startup companies have developed 

10 just such proprietary libraries, and success using combinatorial libraries has been achieved by 
sheer effort. In one example 18 libraries containing 43 million compounds were screened to 
identify 27 active compounds' . With library searches of this magnitude, it is most likely that 
the enormous number of inactive molecules [(43 X 10^) - 27] must have included staggering 
numbers of redundantly inactive molecules - molecules not significantly distinguishable from 

15 one another - even in libraries designed with a particular biological target in mind. Clearly, 
when searching for a lead molecule which interacts with an uncharacterized biological target, 
approaches requiring knowledge o^' ^e biological targets will not work. But finding such a lead 
is exactly the case for which it is hoped general purpose screening libraries can be designed. 
If the promise of combinatorial chemistry is ever to be fully realized, some ational and 

20 quantitative method of reducing the astronomical number of compounds accessible in the 
combinatorial chemistry universe to a number which can be usefully tested is required. In other 
words, the efficiency of the search process must be increased. For this purpose, a smaller 
rationally designed screening library, which still retains the diversity of the combinatorially 
accessible compounds, is absolutely necessary. 

25 Thus, there are two criteria which must be met by any screening library subset of some 

universe of combinatorially accessible compounds. First, the diversity, the dissimilarity of the 
universe of compounds accessible by some combinatorial reaction, must be retained in the 
screening subset. A subset which does not contain examples of the total range of diversity in 
such a universe would potentially miss critical molecules, thereby frustrating the very reason 

30 for the creation of the subset. Second, for efficient screening, the ideal subset should not 
contain more than one compound representative of each aspect of the diversity of the larger 
group. If more than one example were included, the same diversity would be tested more than 
once. Such redundant screening would yield no new information while simultaneously 
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increasing the number of compounds which must be synthesized and screened. Therefore, the 
fundamental problem is how to reduce to a manageable number the number of compounds that 
need to be synthesized and tested while at the same time providing a reasonably high 
probability that no possible molecule of biological importance is overlooked. (In this regard, 
5 it should be recognized that the only way of absolutely insuring that all diversity is represented 
in a library is to include and test all compounds.) A conceptual analogy to the problem might 
be: what kind of filter can be constructed to sort out from the middle of a blinding snowstorm 
individual snowflakes which represent all the classes of crystal structures which snowflakes can 
form? 

10 The fundamental question plaguing progress in this area has been whether the concept 

of the diversity of molecular structure can be usefully described and quantified; that is, how 
is it possible to compare/distinguish the physical and chemical properties determinative of 
biological activity of one molecule with that of another molecule? Without some way to 
quantitatively describe diversity, no meaningful filter can be constructed. Fortunately, for 

15 biological systems, the accumulated wisdom of bioscientists has recognized a general principle 
alluded to earlier which provides a handle on this problem. As framed by Johnson and 
Maggiora^ the principle is simply stated as: "structurally similar molecules are expected to 
exhibit similar (biological) properties." Based on this principle, quantifying diversity becomes 
a matter of quantifying the notion of structural similarity. Thus, for design of a screening 

20 subset of a combinatorial library (hereafter referred to as a "combinatorial screening library"), 
it should only be necessary to identify which molecules are structurally similar and which 
structurally dissimilar. According to the selection criteria outlined above, one molecule of each 
structurally similar group in the combinatorially accessible chemical universe would be 
included in the library subset. Such a library would be an optimally diverse combinatorial 

25 screening library. The problem for medicinal chemists is to determine how the intuitively 
perceived notions of structural similarity of chemical compounds can be validly quantified. 
Once this question is satisfactorily answered, it should be possible to rationally design 
combinatorial screening libraries. 
Prior Art Approaches 

30 Many descriptors of molecular structure have been created in the prior art in an attempt 

to quantify structural similarity and/or dissimilarity. As the art has recognized, however, no 
method currently exists to distinguish those descriptors that quantify useful aspects of similarity 
from those which do not. The importance of being able to validate molecular descriptors has 
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been a vexing problem restricting advances in the art, and, before this invention, no generally 
applicable and satisfactory answer had been found. The problem may be conceptualized in 
terms of a multidimensional space of structurally derivable properties which is populated by 
all possible combinatorially accessible chemical compounds. Compounds lying "near" one 
5 another in any one dimension may lie "far apart" from one another in another dimension. The 
difficulty is to fmd a useful design space - a quantifiable dimensional space (metric space) in 
which compounds with similar biological properties cluster; ie., are found measurably near to 
each other. What is desired is a molecular structural descriptor which, when applied to the 
molecules of the chemical universe, defines a dimensional space in which the "nearness" of 

10 the molecules with respect to a specified characteristic (ie. ; biological activity) in the chemical 
universe is preserved in the dimensional space. A molecular structural descriptor (metric) 
which does not have this property is useless as a descriptor of molecular diversity, A valid 
descriptor is defined as one which has this property. 

In light of the above, it should be noted that there is a difference between a descriptor 

15 being valid and being perfect. There may or may not be a "perfect" metric which precisely and 
quantitatively maps the diversity of compounds (much less those of biological interest). 
However, a good approximation is sufficient for purposes of designing a combinatorial 
screening library and is considered valid/useful. Acceptance of this validation/usefulness 
criteria is essentially equivalent to saying that, if there is a high probability that if one 

20 molecule is active (or inactive), a second molecule is also active (or inactive), then most of 
the time sampling one of the pair will be sufficient. Restating this same principle with a 
slightly different emphasis highlights another feature, namely: the design criteria for 
combinatorial screening libraries should yield a high probability that, for any given inactive 
molecule, it is more probable to find an active molecule somewhere else rather than as a near 

25 neighbor of that inactive molecule. While this is a probabilistic approach, it emphasizes that 
a good approximation to a perfect metric is sufficient for purposes of designing a combinatorial 
screening library as well as in other situations where the ability to discriminate molecular 
structural difference and similarities is required. A perfect descriptor (certainty) for 
pharmacological searching is not needed to achieve the required level of confidence as long 

30 as it is valid (maps a subspace where biological properties cluster). 

The typical prior art approach for establishing selection criteria for screening library 
subsets relied on the following clustering paradigm: 1) characterization of compounds 
according to a chosen descriptor(s) (metric[s]); 2) calculation of similarities or "distances" in 



the descriptor (metric) between all pairs of compounds; and 3) grouping or clustering of the 
compounds based on the descriptor distances. The idea behind the paradigm is that, within a 
cluster, compounds should have similar activities and, therefore, only one or a few compounds 
from each cluster, which will be representative of that cluster, need be included in a library. 
The actual clustering is done until the prior art user feels comfortable with the groupings and 
their spacing. However, with no knowledge of the validity/usefulness of the descriptor 
employed, and no guidance with respect to the size or spacing of clusters to be expected from 
any given descriptor, prior art clustering has been, at best, another intuitive "seat of the pants" 
approach to diversity measurement. 

The prior art describes the construction and application of many molecular structural 
descriptors while all the while tacitly acknowledging that little progress has been made towards 
solving the fundamental problem of establishing their validity. The field has nevertheless 
proceeded based on the belief/ faith that, by incorporating in the descriptors certain measures 
which had been recognized in QSAR studies as being important contributors to defining 
structure-activity relationships, valid/useful descriptors would be produced. In a leading 
method representative of this prior art approach to defining a similarity descriptor, E. Martin 
et al.^ construct a metric for quantifying structural similarity using measures that characterize 
lipophilicity, shape and branching, chemical functionality, and receptor recognition features. 
(For the reasons set forth later in relation to the present invention, Martin et al. applied their 
metric to the reactants which would be used in combinatorial synthesis.) This large set of 
measures is used to generate a statistically blended metric consisting of a total of 16 properties 
for each individual reactant studied (5 shape descriptors, 5 measures of chemical functionality, 
5 receptor binding descriptors, and one lipophilicity property). This generates a 16 dimensional 
property space. The 16 properties are simultaneously displayed in a circular "Flower Plots" 
graphical environment, where each property is assigned a petal. All the plots together visually 
display how the diversity of the studied reactants is distributed through the computed property 
space. Martin acknowledges that the plots "...cannot, of course, prove that the subset is 
diverse in any 'absolute' sense, independent of the calculated properties." (Martin at 1434) 

In another approach relating to peptoid design, Martin et al."* have characterized the 
varieties of shape that an unknown receptor cavity might assume by a few assemblages of 
blocks, called "polyominos". Candidates for a combinatorial design are classified by the types 
of polyominos into which they can be made to fit, or "docked". The 7 flexible polyomino 
shape descriptors are added to the previously defined 16 descriptors to yield a 23 dimensional 
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property space. Martin has demonstrated that the docking procedure generates for a 
methotrexate ligand in a cavity of dihydrofolate reductase nearly the correct structure as that 
established by X-ray diffraction studies. The docking procedure, which must be applied to 
every design candidate for each polyomino, requires a considerable amount of CPU time (is 
5 computationally expensive). However, a problem with this approach is the conceptually severe 
(unjustified) approximation of representing all possible irregularly shaped receptor cavities by 
only about a dozen assemblies of smooth-sided polyomino cubes. Martin has also presented 
no validation of the approach, which in this case, would be a demonstration that molecules 
which fit into the same polyominos tend to have similar biological properties. 

10 One approach which has been taken to try to empirically assess the relative validity of 

prior art metrics has been to survey the metrics to see if any of them appeared to be superior 
to any others as judged by clustering analysis. Y. C. Martin et al.^ have reported that 3D 
fingerprints, collections of fragments defined by pairs of atoms and their accessible interatomic 
distances, perform no better than collections of 2D fragments in defining clusters that separate 

15 biologically active from inactive compounds. As will be seen later, some of this work pointed 
towards the possible validity of one metric, but the authors concentrated on the comparative 
clustering aspects and did not follow up on the broader import of the data. 

W. Hemdon^ among others has pointed out that an experimentally determined similarity 
QSAR is, by definition, a good test of the validity of that similarity concept for the biological 

20 system from which it is derived and may have some usefulness in estimating diversity for that 
system. However, QSARs essentially map only the space of a particular receptor, do not 
provide information about the validity of other descriptors, and would be generally inapplicable 
to construction of a combinatorial screening library designed for screening unknown receptors 
or those for which no QSAR data was available. 

25 Finally, D. Chapman et al.^ have used their "Compass" 3D-QSAR descriptor which is 

based on the three dimensional shape of molecules, the locations of polar functionalities on the 
molecules, and the fixation entropies of the molecules to estimate the similarity of molecules. 
Essentially, using the descriptor, they try to find the molecules which have the maximum 
overlap (in geometric/cartesian space) with each other. The shape of each molecule of a series 

30 is allowed to translate and rotate relative to each other molecule and the internal degrees of 
freedom are also allowed to rotate in an iterative procedure until the shapes with greatest or 
least overlap similarity are identified. Selecting 20 maximally diverse carboxylic acids based 
on seeking the maximally diverse alignment of each of the 3000 acids considered took 




9 



approximately 4 CPU computing weeks by their method. No indication was given of whether 
their descriptor was valid in the sense defined above, and, clearly, such a procedure would be 
too time consuming to apply to a truly large combinatorial library design. 

One way in which many of the prior art approaches attempt to work around the 

5 problem of not knowing if a molecular structural descriptor is valid is to try, when clustering, 
to maximize as much as possible the distance between the clusters from which compounds will 
be selected for inclusion in the screening library subset. The thinking behind this approach is 
that, if the clusters are far enough apart, only molecules diverse from each other will be 
chosen. Conversely, it is thought that, if the clusters are close together, oversampling 

10 (selection of two or more molecules representative of the same elements of diversity) would 
likely occur. However, as we have seen, if the metric used in the cluster analysis is not 
initially valid (does not define a subspace in which molecules with similar biological activity 
cluster), then no amount of manipulation will prevent the sample from being essentially 
random. Worse yet, an invalid metric might not yield a selection as good as random! The 

15 acknowledgement by Martin quoted above is a re. gnition of the prior art's failure to yet 
discover a general method for validating descriptors. 

Another related problem in the prior art is the failure to have any objective manner of 
ascertaining when the library subset under design has an adequate number of members; that 
is, when to stop sampling. Clearly, if nothing is known about the distribution of the diversity 

20 of molecules, one arbitrary stopping point is as good as any other. Any stopping point may 
or may not sample sufficiently or may oversample. In fact, the prior art has not recognized 
a coherent quantitative methodology for determining the end point of selection. Essentially, 
in the prior art, a metric is used to maximize the presumed differences between molecules 
(typically in a clustering analysis), and a very large number of molecules are chosen for 

25 inclusion in a screening library subset based on the belief that there is safety in numbers; that 
sampling more molecules will result in sampling more of the diversity of a combinatorially 
accessible chemical space. As pointed out earlier, however, only by including all possible 
molecules in a library will one guarantee that all of the diversity has been sampled. Short of 
such total sampling, users of prior art library subsets constructed along the lines noted above 

30 do not know whether a random sample, a representative sample, or a highly skewed sample 
has been screened. 

Several other problems flow from the inability to rationally select a combinatorial 
screening library for optimal diversity and these are related both to the chemistry used to 
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create the combinatorial library and the screening systems used. First, because many more 
molecules may have to be synthesized than may be needed, mass synthetic schemes have to 
be devised which create many combinations simultaneously. In fact, there is a good deal of 
disagreement in the prior art as to whether compounds should be synthesized individually or 

5 collectively or in solution or on solid supports. Within any synthetic scheme, an additional 
problem is keeping track of and identifying the combinations created. It should be understood 
that, where relatively small (molecular weight of less than about 1500) organic molecules are 
concerned, generally standard, well known, organic reactions are used to create the molecules. 
In the case of peptide like molecules, standard methods of peptide synthesis are employed. 

10 Similarly for polysaccharides and other polymers, reaction schemes exist in the prior art which 
are well known and can be utilized. While the synthesis of any individual combinatorial 
molecule may be straightforward, much time and effort has been and is still being expended 
to develop synthetic schemes in which hundreds, thousands, or tens of thousands of 
combinatorial combinations can be synthesized simultaneously. 

15 In many synthetic schemes, mixtures of combinatorial products are synthesized for 

screening in which the identity of each individual componen; is uncertain. Alternatively, many 
different combinatorial products may be mixed together for simultaneous screening. Each 
additional molecule added to a simultaneous screen means that many fewer individual screening 
operations have to be performed. Thus, it is not unusual that a single assay may be 

20 simultaneously tested against up to 625 or more different molecules. Not until the mixture 
shows some activity in the biological screening assay will an attempt be made to identify the 
components. Many approaches in the prior art therefore face "deconvolution" problems; ie. 
trying to figure out what was in an active mixture either by following the synthetic reaction 
pathway, by resynthesizing the individual molecules which should have resulted from the 

25 reaction pathway, or by direct analysis of duplicate samples. Some approaches even tag the 
carrier of each different molecule with a unique molecular identifier which can be read when 
necessary. All these problems are significantly decreased by designing a library for optimal 
diversity. 

Another major problem with the inclusion of multiple and potentially non-diverse 
30 compounds in the same screening mixture is that many assays will yield false positives (have 
an activity detected above a certain established threshold) due to the combined effect of all the 
molecules in the screening mixture. The absence of the desired activity is only determined after 
expending the time, effort, and expense of identifying the molecules present in the mixture and 




11 



testing them individually. Such instances of combined reactivity are reduced when the 
screening mixture can be selected from molecules belonging to diverse groups of an optimally 
designed library since it is not as likely that molecules of different (diversity) structures would 
likely produce a combined effect. 
5 It is clear that a great deal of cleverness has been expended in actually manufacturing 

the combinatorial libraries. While the basic chemistry of synthesizing any given molecule is 
straight forward, the next advance in the development of combinatorial chemistry screening 
libraries will be optimization of the design of the libraries. 

Further problems in the prior art arise in the attempt to follow up leads resulting from 

10 the screening process. As noted above, many libraries are designed with some knowledge of 
the receptor and its binding requirements. While, within those constraints, all possible 
combinatorial molecules are synthesized for screening, finding a few molecules with the 
desired activity among such a library yields no information about what active molecules might 
exist in the universe accessible with the same combinatorial chemistry but outside the limited 

15 (receptor) library definition. This is an especially troubling problem since, from serendipitous 
experience, it is well known that sometimes totally unexpected molecules with little or no 
obvious similarity to known active molecules exhibit significant activity in some biological 
systems. Thus, even finding a candidate lead in a library whose design was based on 
knowledge of the receptor is no guarantee that the lead can be followed to an optimal 

20 compound. Only a rationally designed combinatorial screening library of optimal diversity can 
approach this goal. 

For prior art library subsets designed around the use of some descriptor to cluster 
compounds, similar problems may exist. In such a library design, one or at most a few 
compounds will have been selected from each cluster. Only if the descriptor is valid, does such 

25 a selection procedure make sense. If the descriptor is not valid, each cluster will contain 
molecules representative of many different diversities and selecting from each cluster will still 
have resulted in a random set of molecules which do not sample all of the diversity present. 
Since the prior art does not possess a generally applicable method of validating descriptors, 
all screening performed with prior art libraries is suspect and may not have yielded all the 

30 useful information desired about the larger chemical universe from which the library subsets 
were selected. 

Finally, as the expense in time and effort of creating and screening combinatorial 
libraries increases, the question of the uniqueness of the libraries becomes ever more critical. 



12 

Questions can be asked such as: 1) does library "one" cover the same diversity of chemical 
structures as library "two"; 2) if libraries "one" and "two" cover both different and identical 
aspects of diversity, how much overlap is there; 3) what about the possible overlap with 
libraries "three", "four", "five", etc.? To date, the prior art has been unable to answer these 
questions. In fact, assumptions have been made that as long as different chemistries were 
involved (ie., proteins, polysaccharides, small organic molecules), it was unlikely that the 
same diversity space was being sampled. However, such an assumption contradicts the well 
known reality that biological receptors can recognize molecular similarities arising from 
different structures. When screening for compounds possessing activity for undefined biological 
receptors, there is no way of telling a priori which chemistry or chemistries is most likely to 
produce molecules with activity for that receptor. Thus, screening with as many chemistries 
as possible is desired but is only really practical if redundant sampling of the same diversity 
space in each chemistry can be avoided. The prior art has not provided any guidance towards 

the resolution of these problems. 

Brief Summary Of The Invention 
In order to select a screening subset of a combinatorially accessible chemical universe 
which is representative of all the structural variation (diversity) to be found in the universe, 
it is necessary to have the means to describe and compare the molecular structural diversity 
in the universe. The first aspect of the present invention is the discovery of a generalized 
method of validating descriptors of molecular structural diversity. The method does not assume 
any prior knowledge of either the nature of the descriptor or of the biological system being 
studied and is generally applicable to all types of descriptors of molecular structure. This 
discovery enables several related advances to the art. 

The second aspect of the invention is the discovery of a method of generating a 
validated three dimensional molecular structural descriptor using CoMFA fields. To generate 
these field descriptors required solving the alignment problem associated with these 
measurements. The alignment problem was solved using a topomeric procedure. 

A third aspect of the invention is the discovery that validated molecular structural 
descriptors applicable to whole molecules can be used both to: 1) quantitatively define a 
meaningful end-point for selection in defining a single screening librar>' (sampling proce<iure); 
and 2) merge libraries so as not to include molecules of the same or similar diversity. It is 
shown that a known metric (Tanimoto 2D fingerprint similarity) can be used in conjunction 
with the sampling procedure for this purpose. 
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A fourth aspect of the invention is the discovery of a method of using validated reactant 
and whole molecule molecular structural descriptors to rationally design a combinatorial 
screening library of optimal diversity. In particular, the shape sensitive topomeric CoMFA 
descriptor and the atom group Tanimoto 2D similarity descriptor may be used in the library 
5 design. As a benefit of designing a combinatorial screening library of optimal diversity based 
on validated molecular descriptors, many prior art problems associated with the synthesis, 
identification, and screening of mixtures of combinatorial molecules can be reduced or 
eliminated. 

A fifth aspect of the invention is the use of validated molecular structural descriptors 
10 to guide the search for optimally active compounds after a lead compound has been identified 
by screening. In the case of a screening library designed for optimal diversity using validated 
descriptors, a great deal of the information necessary for lead optimization flows directly from 
the library design. In the case where a lead has been identified by screening a prior art library 
or through some other means, validated descriptors provide a method for identifying the 
15 molecular structural space nearest the lead which is most likely to contain compounds with the 
same or similar activity. 

A sixth aspect of this invention is the discovery of a method for generating, using 
validated molecular descriptors, a virtual library of product molecules derivable from 
combinatorial reactions (or which may be represented by a combinatorial SLN [CSLN]) in 
20 which the characteristics of product molecules can be searched and compared without the 
actual construction of the product molecules. This virtual library allows the searching of 
billions of possible product molecules in reasonable amounts of time. 

A seventh aspect of this invention is the discovery that, using validated molecular 
descriptors, the virtual library can be searched over billions of possible product molecules in 
25 ways to yield both optimally diverse screening libraries and to follow up on lead explosions. 
Using the virtual library, a much larger fraction of the chemically accessible universe can be 
searched for molecules of interest. 

An eighth aspect of this invention is the discovery of a way to search, using validated 
molecular descriptors, the virtual library for possible molecules which have similar structures 
30 and/or activities to a query molecule which is not necessarily derived from a combinatorial 
synthesis. This discovery opens up a whole new method for seeking molecules with similar 
characteristics to a previously identified molecule. 

It is an object of this invention to define a general process which may be used with 
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randomly selected literature data sets to validate molecular structural descriptors. 

It is a further object of this invention to define a process to derive CoMFA steric fields 
(and, if desired, additional relevant fields) using topomeric alignment so that the resulting 
descriptor is valid, 

5 It is a further object of this invention to teach that topomeric alignments may be used 

to describe molecular conformations. 

It is a further object of this invention to define a general process for using a validated 
molecular descriptor to establish a meaningful end-point for the sampling of compounds 
thereby avoiding the oversampling of compounds representing the same molecular structural 

10 characteristics. 

It is yet a further object of this invention to design an optimally diverse combinatorial 
screening library using multiple validated molecular structural descriptors. 

It is a further object of this invention to use the topomeric CoMFA molecular structural 
descriptor as a reactant descriptor in the design of an optimally diverse combinatorial screening 
15 library. 

It is a further object of this invention to use the Tanimoto 2D similarity molecular 
structural descriptor as a product descriptor in the design of an optimally diverse combinatorial 
screening library. 

It is a further object of this invention to define a method for merging assemblies of 
20 molecules (libraries), both those designed by the methods of this invention and others not 
designed by the methods of this invention, in such a manner that molecules representing the 
same or similar diversity space are not likely to be included. 

It is a further object of this invention to define methods for the use of validated 
molecular structural descriptors to guide the search for optimally active compounds after a lead 
25 compound has been identified by screening or some other method. 

It is a further object of this invention to generate a virtual library, using validated 
molecular descriptors, of potential product molecules derivable from combinatorial reactions 
(or which may be represented by a combinatorial SLN [CSLN]) which can be searched for 
molecules having desired characteristics. 
30 It is a further object of this invention to define methods for creating optimal diversity 

screening libraries as subsets of the virtual library. 

It is still a further object of this invention to locate within the virtual library possible 
product molecules similar in structure and/or activity to lead compounds. 
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These and further objects of the invention will become apparent from the detailed 
description of the invention which follows. 

PtWpf npsrri ption nf Drawings 

Figure 1 schematically shows the distribution of molecular structures around and about 
5 an island of biological activity in a hypothetical two dimensional metric space for a pooriy 
designed prior art library and for an efficiently designed optimally diverse screening library. 

Figure 2 shows a theoretical scatter plot (Patterson Plot) for a metric having the 
neighborhood property in which the X axis shows distances in some metric space calculated 
as the absolute value of the pairwise differences in some candidate molecular descriptor and 
10 the Y axis shows the absolute value of the pairwise differences in biological activity. 
Figure 3 shows a Patterson plot for an illustrative data set. 

Figure 4 shows a Patterson plot for the same data set as in Figure 3 but where the 
diversity descriptor values (X axis) associated with each molecule have been replaced by 
random numbers. 

15 Figure 5 shows a Patterson plot for the same data set as in Figure 3 but where the 

diversity descriptor values (X axis) associated with each molecule have been replaced by a 
normalized force field strain energy/atom value. 

Figure 6 shows three molecular structures numbered and marked in accordance with 

the topomeric alignment rule. 
20 Figure 7 is a complete set of Patterson plots for the twenty data sets used for the 

validation studies of the topomeric CoMFA descriptor. 

Figure 8 shows the two scatter plots displaying the relation between X^ values and their 
corresponding density ratio values for the tested metrics over the twenty random data sets. 

Figure 9 shows the graphs of the Tanimoto similarity measure vs. the pairwise 
25 frequency of active molecules for 18 groups examined from Index Chemicus. 

Figure 10 shows a Patterson plot of the Cristalli data set using only those values which 
would have been used for a Tanimoto sigmoid plot of the same data set alongside a Patterson 

plot of the complete data set. 

Figure 1 1 is a schematic of the combinatorial screening library design process. 
30 Figure 12 shows a comparison of the volumes of space occupied by different molecules 

which are determined to be similar according to the Tanimoto 2D fingerprint descriptor but 
which are determined to be dissimilar according to the topomeric CoMFA field descriptor. 
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Figure 13 shows a plot of the Tanimoto 2D pairwise similarities for a typical 

combinatorial product universe. 

Figure 14 shows the distribution of molecules resulting from a combinatorial screening 
library design plotted according to their Tanimoto 2D pairwise similarity after reactant filtering 

5 and after final product selection. 

Figure 15 shows the distribution of molecules plotted according to their Tanimoto 2D 
pairwise similarity of three database libraries (Chapman & Hall) from the prior art. 

Figure 16 shows a schematic representation of sets of possible reactants attached to a 

central core. 

10 Figure 17 is a flowchart summarizing the overall process of virtual library construction. 

Figures 18,19, and 20 are a flowchart summarizing the overall process of applying the 
Tanimoto fingerprint metric for use in the virtual library. 

Figures 21, 22, and 23 are a flowchart summarizing the overall process of using the 
Tanimoto fingerprint metric to search for molecules. 
15 Figures 24, 25, and 26 are a flowchart summarizing the overall process of using both 

the topomeric CoMFA and Tanimoto metrics to search for molecules in the virtual library. 

Figures 27, 28, 29, and 30 are a flowchart summarizing the overall process for 
topomeric searches of arbitrary query molecules. 

Figure 31 shows the topomeric conformations of Tagamet and Zantac. 

20 Di>;c1osure O f Invention 

1. Computational Chemistry Environment 

2. Definitions 

3. Validating Metrics 

A. Theoretical Considerations - Neighborhood Property 
25 B. Construction, Application, and Analysis Of Patterson Plots 

4. Topomeric CoMFA Descriptor 

A. Topomeric Alignment 

i. General Topomeric Allignment 

ii. Specialized Allignment for Chiral and Equivalent Atoms 
3Q B. Calculation Of CoMFA and Hydrogen Bonding Fields 

C. Validation Of Topomeric CoMFA Descriptor 

5. Tanimoto Fingerprint Descriptor 
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A. Neighborhood Property 

B. Applicability Of Tanimoto To Different Biological Systems 

C. Comparison of Sigmoid and Patterson Plots 

6. Comparison of Tanimoto and Topomeric CoMFA Metrics 

7. Additional Validation Results 

8. Combinatorial Library Design Utilizing Validated Metrics 

A. Removal Of Reactants For Non-Diversity Reasons 

i. General Removal Criteria 

ii. Biologically Based CriteriaS 

B. Removal of Non-Diverse Reactants 

C. Identification (Building) Of Products 

D. Removal Of Products For Non-Diversity Reasons 

E. Removal of Non-Diverse Products 

9. Lead Compound Optimization 

15 A. Advantages Resulting From Product Filter 

B. Advantages Resulting From Reactant Filter 

C. Additional Optimization Methods Using Validated Metrics 

10. Merging Libraries 

11. Other Advantages of Optimally Diverse Libraries 
20 12. Virtual Library Construction & Searching 

A. Derivation of the Database (Virtual Library) of Compounds 

B. Overview of Methodology 

C. Overview of Virtual Library Construction 

D. Virtual Library Construction 

i. Representation of the Database of Compounds 

ii. Application of A First Metric (Topomeric CoMFA) 

iii. Application of A Second Metric (Tanimoto Fingerprint) 

iv. Summary of Method & Scope of Chemistry 

E. Searching the Virtual Library 

i. Example Search Routine of Virtual Library - Tanimoto 

Similarity 

ii. Design Screening Libraries (Subsets of the Virtual Library) 
(a) Subset Screening Library Based On Topomeric Fields 
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and Tanimoto 

(b) Subset Based on Tanimoto Similarity 

(c) Subset Based on Topomeric Fields 

(d) Subset Based on Combined Metric 
iii. Designing Lead Optimizations 

(a) Search Based on Tanimoto Similarity 

(b) Searches Based on Topomer Similarity 

(c) Topomeric (3D) Searching of Arbitrary Molecular 

Structures 

(d) Topomeric (3D) Searching of Core Structures 



1 ■ Computational Chemistry Environment 

Generally, all calculations and analyses to conduct combinatorial chemistry screening 
library design and follow up are implemented in a modem computational chemistry 
environment using software designed to handle molecular structures and associated properties 
15 and operations. For purposes of this Application, such an environment is specifically 
referenced. In particular, the computational environment and capabilities of the SYBYL and 
UNITY software programs developed and marketed by Tripos, Inc. (St. Louis, Missouri) are 
specifically utilized. Unless otherwise noted, all software references and commands in the 
following text are references to functionalities contained in the SYBYL and UNITY software 
20 programs. Where a required functionality is not available in SYBYL or UNITY, the software 
code to implement that functionality is provided in an Appendix to this Application. Software 
with similar functionalities to SYBYL and UNITY sltc available from other sources, both 
commercial and non-commercial, well known to those in the art. A general purpose 
programmable digital computer with ample amounts of memory and hard disk storage is 
25 required for the implementation of this invention . In performing the methods of this invention , 
representations of thousands of molecules and molecular structures as well as other data may 
need to be stored simultaneously in the random access memory of the computer or in rapidly 
available permanent storage. The inventors use a Silicon Graphics, Inc. Challenge-M computer 
having a single ISGMhz R4400 processor with 128 Mb memory and 4Gb hard disk storage 
30 space. As the size of the virtual library increases, a corresponding increase in hard disk storage 
and computational power is required. For these tasks, access to several gigabytes of storage 
and Silicon Graphics, Inc. processors in the R4400 to R 10000 range are useful. 
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2. Definitions: 

The words or phrases in capital letters shall, for the purposes of this application, have 

the meanings set forth below: 

2D MEASURES shall mean a molecular representation which does not include any 
terms which specifically incorporate information about the three dimensional features of the 
molecule. 2D is a misnomer used in the art and does not mean a geometric "two dimensional" 
descriptor such as a flat image on a piece of paper. Rather, 2D descriptors take no account of 
geometric features of a molecule but instead reflect only the properties which are derivable 
from its topology; that is, the network of atoms connected by bonds. 

2D FINGERPRINTS shall mean a 2D molecular measure in which a bit in a data string 
is set corresponding to the occurrence of a given 2-7 atom fragment in that molecule. 
Typically, strings of roughly 900 to 2400 bits are used. A particular bit may be set by many 
different fragments. 

COMBINATORIAL SCREENING LIBRARY shall mean a subset of molecules selected 
from a combinatorial accessible universe of molecules to be used for screening in an assay. 

MOLECULAR STRUCTURAL DESCRIPTOR shall mean a quantitative representation 
of the physical and chemical properties determinative of the activity of a molecule. The term 
METRIC is synonymous with MOLECULAR STRUCTURAL DESCRIPTOR and is used 
interchangeably throughout this Application. 

PATTERSON PLOTS shall mean two dimensional scatter plots in which the distance 
between molecules in some metric is plotted on the X axis and the absolute difference in some 
biological activity for the same molecules is plotted on the Y axis. 

SIGMOID PLOTS shall mean two dimensional plots for which the proportion of 
molecular pairs in which the second molecule is also active is plotted on the Y axis and the 
pairwise Tanimoto similarity is plotted in intervals on the X axis. 

TOPOMERIC ALIGNMENT shall mean conformer alignment based on a set of 

alignment rules. 

3. Validating Metrics 

A. Theoretical Considerations - Ne ighborhood Property 

As noted above, the similarity principle suggests a way to quantify the concept of 
diversity by quantifying structural similarity. While the prior art devised many structural 
descriptors, no one has been able to explicitly show that any of the descriptors are valid. It is 
possible with the method of this invention to determine the validity of any metric by applying 
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it to presently existing literature data sets, for which values of biological activity and molecular 
structure are known. Once the validity has been determined, the metric may be used with 
confidence in designing combinatorial screening libraries and in following up on discovered 
leads. Examples of these applications will be given below. 
5 The present invention is the first to recognize that the similarity principle also provides 

a way to validate metrics. Specifically, the similarity principle requires that any valid 
descriptor must have a "neighborhood property". That is: the descriptor must meet the 
similarity principle's constraint that it measure the chemical universe in such a way that similar 
structures (as defined by the descriptor) have substantially similar biological properties. Or 

10 stated slightly differently: within some radius in descriptor space of any given molecule 
possessing some biological property, there should be a high probability that other molecules 
found within that radius will also have the same biological property. If a descriptor does not 
have the neighborhood property, it does not meet the similarity principle, and can not be valid. 
Regardless of the computations involved or the intentions of the users, using prior art 

15 descriptors without the neighborhood property results, at best, in random selection of 
compounds to include in screening libraries. 

The importance of the neighborhood property to the design of combinatorial screening 
libraries is schematically illustrated in Figure 1. Figure lA and Figure IB show an "island" 
1 of biological activity plotted in some relevant two dimensional molecular descriptor space. 

20 In Figure lA the molecules 2 of a typical prior art library are plotted as hexagons. Around 
each hexagon a circle 3 describes the area of the metric space (the neighborhood) in which 
molecules of similar structural diversity to the plotted molecule would be found. Since the 
prior art metric used to select these molecules was not valid, the molecules are essentially 
distributed at random in the metric space. The circles 3 (neighborhoods) of similar structural 

25 diversity of several of the molecules overlap at 4 indicating that they sample the same diversity 
space. Clearly, there is no guarantee that the island area will be adequately sampled or that 
a great deal of redundant testing will not be involved with such a library design. 

In Figure IB the molecules 5 of a optimally designed library are plotted as stars along 
with their corresponding circles 3 of similar structural diversity. Since a valid molecular 

30 descriptor with the neighborhood property was used to select the molecules, molecules were 
identified which not only sampled that part of the descriptor space accessible with the 
molecular structures available but also did not sample the same descriptor space more than 
once. Clearly, the likelihood of sampling the "island" 1 is greater when it is possible to 
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identify the unique neighborhood 3 around each sample molecule and choose molecules that 
sample different areas. Figure IB represents an optimally diverse design. 

A method to quantitatively analyze whether any given metric obeys the neighborhood 
principle has been discovered. In the prior art, absolute values of biological activity have 
always been considered the dependent variable with the structural metric as the independent 
variable. This is the case for traditional QSARs (quantitative structure activity relationships). 
Note however, that the similarity principle requires that for any pair of molecules, differences 
in activity are related to differences in structure. In particular, small differences in structure 
should be associated with small differences in activity. However, the converse is not 
necessarily true; large differences in activity are not necessarily associated with large 
differences in structure. The first novel feature of the present invention is that it uses 
differences in both measures: biological differences and structural (metric) differences. There 
is no rationale present in the prior art suggesting that the use of both differences in such a 
manner would be useful. Thus, instead of looking at the values assigned by the metric to each 
15 molecule, the absolute differences in the metric values for each pair of molecules are the 
independent variables and the absolute differences in biological activity for each pair of 
molecules are the dependent variables. The absolute value is used since it is the difference, not 

its sign, which is important. 

For a metric possessing the neighborhood property, a scatter plot of pairwise absolute 

20 differences in descriptors for each set of molecules versus pairwise absolute differences in 
biological activity for the same set of molecules (Patterson plot) will have a characteristic 
appearance as shown in Figure 2. Note that it is important that pairwise absolute differences 
for all molecules in a data set are used, that is; the absolute metric "distance" between every 
molecule and every other molecule is plotted. Accordingly, there are n(n-l)/2 pairwise 

25 comparisons for every data set containing n compounds. The use of pairwise differences for 
every possible pair reflects all the relationships between all structural changes with all activity 
changes for the molecules under study. 

Line 1 on the graph of Figure 2 depicts a special case where there is a strictly linear 
relationship between differences in metric distance and differences in biological activity. 

30 However, the neighborhood property does not imply a linear correlation (corresponding to 
points lying on a straight line) and need not imply anything about large property differences 
causing large biological activity differences. (Generally, the line should be linear for only very 
small changes in molecular structure and would exhibit a complex shape overall depending on 
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the nature of the biological interaction. However, for purposes of discussion and analysis, it 
is useful to employ a straight line as a first approximation.) The slope of line 1 will vary 
depending on the biological activity of the measured system. Thus, the lower right trapezoid 
(LRT) {defined by the vertices [0,0], [actual metric value, max. bio. value], [max. metric 
5 value, max. bio. value], and [max. metric value, 0]} of the plot may be populated as shown 
in any number of ways. 

The upper left triangle (ULT) of the plot (above the line) should not be populated at 
all as long as the descriptor completely characterizes the compound and there are no 
discontinuities in the behavior of the molecules. However, in the real world, some population 

10 of the space (as indicated by points 2) above the line would be expected since there are known 
discontinuities in the behavior of real molecular ligands. For instance, it is well known 
amongst medicinal chemists that adding one methyl group can cause some very active 
compounds to lose all sign of activity. 

Figure 3 shows a Patterson plot of a real world example. Points lying above the solid 

15 line near the Y axis reflect a metric space where a small difference in metric property 
(structure) produces a large difference in biological property. These points clearly violate the 
similarity principle/neighborhood rule. Thus, in the real world sometimes relatively small 
differences in structure can produce large differences in activity. If some points lie above the 
line, the metric is less ideal, but, clearly still useful. The major criteria and the key point to 

20 recognize is that for a metric to be valid the upper left triangle will be substantially less 
populated than the lower right trapezoid. 

Thus, it should be recognized that for any receptor, the presence of some particular side 
group or combination of side groups may produce a discontinuity in the receptor response. 
Generally, however, any (metric) descriptor displaying the above characteristic of 

25 predominantly populating the lower right trapezoid (such as in Figure 3) will possess the 
neighborhood property, and the demonstration that a metric possesses such behavior indicates 
the validity/usefulness of that metric. Conversely, a descriptor in which the points in the 
difference plot are uniformly distributed (equal density of points in ULT and LRT) does not 
obey the neighborhood principle and is invalid as a metric. While a brief glance at the 

30 difference plots may quickly indicate validity or non-validity, visual analysis may be 
misleading. As it turns out, data points in the plot frequently overlap so that visually only one 
point is seen where there may be two (or more). A quantitative analysis of the data 
distribution, therefore, yields a more accurate picture. An objective validation procedure for 
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determining the validity/usefulness of metrics from Patterson plots of real world data including 
a method for assessing its statistical significance is set forth below. 

Viewing the metric data in this way requires no knowledge about either the actual value 
of the biological activities or the actual values assigned by the descriptor under review. 
Because all pairwise differences are displayed, all possible gradations of molecular structural 
diversity and activity are represented and utilized. Consequently, there is no arbitrary lower 

limit set on the usable data. 

R rnnstruction. Application and Analvsis Of Patterson Plots 
For purposes of objectively examining metrics for validity, it is first necessary to 
accurately determine the slope (placement) of the line which divides a Patterson plot into the 
two areas, a lower right trapezoid (LRT) and an upper left triangle (ULT). The triangle is 
defined by the points [0, 0], [actual metric value, max. bio. value], and [0, max. bio. value]. 
The trapezoid is defined by the points [0,0], [actual metric value, max. bio. value], [max. 
metric value, max. bio. value], and [max. metric value, 0]. For a metric to be a valid and a 
useful measure of molecular diversity, the density of points in the lower right trapezoid should 
be significantly greater than the density in the upper left triangle. To determine the correct 
placement of the line, the variation in tl ty of points is used. The line must always pass 

through (0,0) at the lower left corner oi . .aerson plot since no change in any metric must 
imply no change in the biological activity. As noted eariier, considering a straight line is only 
a first approximation. A "perfect" metric, which totally describes the structure activity 
relationship of the biological system, would display a complex line reflecting the biological 
interaction. As a first approximation, a "useful" straight line can be found which meaningfully 
reflects the variation in the density of points. 

The preferred search for the correct/ useful line tests only those slopes which a 
particular data set can distinguish; specifically those drawn from [0,0] to each point [actual 
metric value, max bio value]. The process starts by drawing the line to a point having the 
smallest actual metric value [smallest metric value, max. bio. value] and continues for all of 
the values observed for actual metric value up to the largest [largest metric value, max. bio. 
value]; ie, subsequent lines are of decreasing slope. (In the limiting case of drawing the line 
to [largest metric value, max. bio. value] the trapezoid becomes a triangle.) When searching 
for the correct diagonal, it is defined to be the one which yields the highest density (number 
of data points/unit graph area) for a lower right triangle, which for this process is defined to 
have its vertices at [0, 0], [actual metric value, 0], and [actual metric value, max bio. value]. 
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Thus, the line is identified based on the density of points under this triangle, but the evaluation 
ratios for the metric are calculated based on the density within the trapezoid compared to the 
density of the entire plot (sum of triangle and trapezoid areas). The software necessary to 
implement this procedure (as well as to determine the values to be discussed below) is 
5 contained in Appendix "A". There may be other procedures for determining the placement of 
the line since the line is only a first approximation. Any such procedure must meet two tests: 
1) it must consistently distinguish between diversity descriptors; and 2) it must clearly 
distinguish/recognize meaningless diversity descriptors. The procedure described here clearly 
meets both tests. (The preferred search for the placement of the line is as described above. 

10 However, the lines shown in the Figures accompanying this description were found slightly 
differently. For the Figures, the search was started by requiring that the diagonal also pass 
through the point defined by the largest descriptor difference and the maximum biological 
activity difference [ max. metric value, max. bio. value]. The line was then systematically tilted 
towards the vertical trying each of 100 evenly spaced steps (in terms of the Y/X ratio). As in 

15 the preferred method, the line yielding the highest density for the LRT was drawn. The line 
placements yielded by the two methods are not substantially different. All numerical values 
reported in this specification were obtained from Patterson plots in which the preferred line 
drawing process was used.) 

The Patterson plot showing the diagonal for an exemplary data set used to validate the 

20 topomeric CoMFA descriptor (discussed in Section 4.C. below) is shown in Figure 3. For 
comparison, Figures 4 and 5 show Patterson plots for two other variations of the same data 
which would not be expected to be valid molecular "measurements" useful as diversity metrics. 
For Figure 4, in place of the actual metric values of Figure 3, random numbers were generated 
for the diversity descriptor values of each compound and the Patterson plot generated from the 

25 differences in these random numbers. As expected from a random number assignment, no line 
can be found by the procedure which enriches the density in the triangle and the best ratio is 
not significantly different from 1.0. The best line is always reported by the procedure, which 
in this case corresponds to a nearly vertical line drawn to the point [minimum metric value, 
max. bio. value]. For randomly distributed values, this line yields the highest density for the 

30 test triangle since the X axis value and, therefore, the area of the tested triangle, is at a 
minimum. It is possible with some random data sets that this line, although nearly vertical, 
might include a couple points under the line. The placement of the line at this position is 
essentially an artifact of the procedure which results from an inability to find any other line 
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which enriches the density in the tested triangle. 

Because random numbers are not "real" metrics, an example of a "real molecular 
measurement" that is unlikely to be a valid diversity metric was examined. For the Patterson 
plot of Figure 5, a force field strain energy (for the topomeric conformations using the 
5 standard Tripos force field) was calculated for each of the compounds in the same data set as 
was used for Figures 3 and 4. Because force field strain energy tends to increase with the 
number of atoms and thus, correlate roughly with the occasionally useful molecular weight, 
to normalize the value, the force field energy was divided by the number of atoms in each 
molecule. As expected, just as with random numbers, no optimum line could be found. This 
10 is essentially a confirmation that the points in the graph were also distributed randomly. Again, 
the best ratio is not significantly different from 1.0. 

To objectively quantify the validity/usefulness determination, the ratio of the density 
of points in the lower right trapezoid to the average density of points is determined. This value 
can vary from somewhere above 0 but significantly less than 1, through 1 (equal density of 
15 points in each area) to a maximum of 2 (all the points in the lower right trapezoid, and the 
upper triangle and lower f v^oid are equal in area [limiting case of trapezoid merging mto 
triangle]). According to th. .neoretical considerations discussed above, a ratio very near or 
equal to 1 (approximately equal densities) would indicate an invalid metric, while a ratio 
(significantly) greater than 1 would indicate a valid metric. The value of this ratio is set forth 
20 next to each Patterson plot in Figures 3 (real data), 4 (random numbers substituted), and 5 
(force field energy substituted) under the column "Density Ratio". Clearly, the topomenc 
CoMFA data of Figure 3 reflect a valid metric (ratio much larger than 1), while the random 
numbers of Figure 4 and force field energies of Figure 5 reflect a meaningless invalid metric 
(ratio very near 1). As will be discussed below, a density ratio of 1.1 is a useful threshold of 
25 validity/usefulness for a molecular diversity descriptor. 

The statistical significance of the Patterson plot data can also be determined by a chi- 
squared test at any chosen level of significance. In this case the data are handled as: 

2 (Actual LRT Count - Expected LRT Countf 
^ ~ Expected LRT Count 

^Here: ^ecea LRT Cou„, = » Tou.1 Coun. 
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The chi-squared values for the Patterson plots of Figures 3, 4, and 5 are also set forth next to 
the plots under the column For 95 % confidence limits and one degree of freedom, the chi- 
squared value is 3.84. The chi-squared values confirm the visual inspection and density ratio 
observations that the CoMFA metric is valid and the other two "constructed" metrics are 
5 invalid. A full set of topomeric CoMFA, random number, and force field data are discussed 
below under validation of the topomeric CoMFA descriptor. 

The analysis of metrics using the difference plot of this invention is a powerful tool 
with which to examine metrics and data sets. First, the analysis can be used with any system 
and requires no prior assumptions about the range of activities or structures which need to be 
10 considered. Second, the plot extracts all the information available from a given data set smce 
pairwise differences between all molecules are used. The prior art believed that not much 
information, if any, could be extracted from literature data sets since, generally, there is not 
a great deal of structural variety in each set. On the contrary, as will be shown below, using 
the Patterson plot method of this invention, a metric can be validated based on just such a 
15 limited data set. As will also be demonstrated below, metrics can be applied to literature data 
sets to determine the validity of the metrics. This ability opens up vast amounts of pre-existing 
literature data for analysis. Since in any analysis there is always a risk of making an improper 
determination due to sampling error when too few data sets are used or too narrow a vanety 
of biological systems (activities) are included, the ability to use much of the available literature 
20 is a significant advance in the art. Also, the fact that the validation analysis methodology of 
this invention is not dependent on the study of a specific biological system, strongly implies 
that a validated metric is very likely to be applicable to molecular structures of unknown 
biological activity encountered in designing combinatorial screening libraries or making other 
diversity based selections. Or stated slightly differently, there is a high degree of confidence 
25 that metrics validated across many chemistries and biologies can be used in situations where 
nothing is known about the biological system under study. 
d Tn pnmeric CoM FA Descriptor 

Many of the prior art descriptors are essentially 2D in nature. That this is the case with 
the prior art probably reflects three underiying reasons. First, the rough general associations 
30 between fragments and biological properties were validated statistically decades ago.« Second, 
2D fragment keys or "fingerprints" are widely available since Ihey are used by all commercial 
molecular database programs to compare structures and expedite retrieval. Third, no one in 
the prior art has yet met the challenge of figuring out how to formulate and validate an 
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appropriate three dimensional molecular structural descriptor. The situation in the prior art 
before the present invention is very similar to the field of QSAR about ten years ago. Then, 
the prior art had long recognized the desirability of three dimensional descriptors but had not 
been able to implement any. When a 3D technique (CoMFA) became available^ its widespread 
5 acceptance'" and application" confirmed the expected importance of 3D descriptors in general. 

It has been discovered that a CoMFA approach to generating a molecular structural 
descriptor using a specially developed alignment procedure, topomeric alignment, produces a 
three dimensional descriptor of molecules which is shown to be valid by the method outlined 
above. In addition, this new descriptor provides a powerful tool with which to design 
10 combinatorial screening libraries. It is equally useful any time selection based on diversity 
from within a congeneric series is required. A full description of CoMFA and the generation 
of molecular interaction energies is contained in U.S. Patents 5,025,388 and 5,307,287. The 
disclosures of these patents are incorporated in this Application. The usual challenge in 
applying CoMFA to a known set of molecules is to determine the proper alignment of the 
15 molecular structures with respect to each other. Two molecules of identical structure will have 
substantially different molecular interaction energies if thev are translated or rotated so as t. 
move their atoms more than about 4 A from their original positions. Thus, alignment is hard 
enough when applying CoMFA to analyze a set of molecules which interact with the same 
biological receptor. The more difficult question is how to "align" molecules distributed in 
20 multidimensional chemistry space to create a meaningful descriptor with respect to arbitrary 
and unknown receptors against which the molecules will ultimately be tested. The topomeric 
alignment procedure was developed to correct the usual CoMFA alignments which often over- 
emphasize a search for "receptor-bound", "minimum energy", or "field-fit" conformations. It 
has been discovered that, when congenericity exists, a meaningful alignment results from 
25 overlaying the atoms that lie within some selected common substructure and arranging the 
other atoms according to a unique canonical rule with any resulting steric collisions ignored. 
When CoMFA fields are generated for molecules so aligned, it has been discovered that the 
resulting field differences are a valid molecular structural descriptor. 

Two major advantages are achieved by applying the topomeric CoMFA metric to the 
30 reactants proposed for use in a combinatorial synthesis rather than the products resulting from 
the synthesis. First, the computational time/effort is dramatically reduced. Instead of analyzing 
for diversity a combinatorial matrix of product compounds (Rl x R2 x R3 ...) only the 
values for the sum of the reactants (Rl -h R2 + R3 ...) need to be computed. For example. 
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assuming 2000 reactants for Rl and 2000 reacUnts for R2, only 4000 calculations need be 
performed on the reactants versus 200ff (4,000,000) if calculations on the comb.natonal 
products were performed. Se«,nd. by identifying reactants which explore similar dive^ty 
spa^, it is only necessary to chc«se one of each reactant representative of each diverstty. Thts 
immediately reduces the number of combinatorial products which need to be constdered and 

synthesized. 

A Tnpomen f^ Alignment 

usually a CoMFA modeler seeks low energy conformations. However, if alignment 
with unknown rotors is desired (such as is the case in designing combinatorial screentng 
libraries for general puv>- screening), then the major goal in conformer generation must be 
that molecules having similar topologies should produce similar felds. In fact, topomcnc 
COMFA fields may be used as a validated diversity descriptor ,0 identify molecules wtth 
similar or dissimilar structures anytime there is a problem of having more compounds than cat, 
be easily dealt with. Thus, its applicability extends well beyond its use in combmatonal 
chemistry to all situations where it is necessary to analyze an existing group of compounds or 
specify the creation of new ones. The topomeric altgnment procedure is especially apphcable 
to the design of a combinatorial screening library. Typically, as noted earlier, in the creatton 
of combinatorially derived compouno, .nere is often an invariant central core to whtch a 
variety of side chains (contributed by reactants of a particular class) are attached at the open 
, valences. Within the combinatorial products, this central core tethers each of the side chains 
contributed by any set of reactants into the same relative position in space. In the language of 
CoMFA alignments, the side chains contributed by each reactaitt can thus be onented by 
overiapping the bond that attaches the side chain to the central core and using a topomertc 
protocol to select a representative conformation of the side chain. Nowhere does the pnor ar, 
5 suggest that a topomeric protocol could possibly yield a meaningful alignment. Indeed, the 
prior art inherently teaches away from the idea because the topomerically derived conformers 
often may be energetically inaccessible and incapable of binding to any receptor. 

The idea of a topomeric conformer is that it is rule based. The exact rules may be 
modified for specific circumstances. In fact, once it is appreciated from the teaching of thts 
invention that a parttcular topomeric protocol is useful (yields a valid molecular descnptor). 
other such protocols may be designed and their use is considered wtthin the teaching of thts 
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i Tienerai Tnpnmeric Alignment 
With th= exception of two specialized sitttatiors (molecules containing chiral atoms or 
reouiring a choice between two equivalent atoms) which will be discussed in section 4<A)(u) 
below, the following topologic^lybased ™les will generate a single, consistent, unambiguous, 
aligned topomeric conformation for any molecule. The software necessary to implement , ,s 
procedure is conuined in Appendix "A". The starting point for a topomeric alignment of a 
molecule is a CONCORD generated three dimensional model which is then FIT as a ng.d body 
onto a template 3D mode, by least-squares minimization of the distances between structurally 
corresponding atoms. By convention, the template model is originally oriented so that one of 
its atoms is at the Cartesian origin, a second lies along the X axis, and a third hes ,n the XY 
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plane 



Torsions ar^ then adjusted for all bonds which: 1) are single and acyclic; 2) connect 
polyvalent atoms; and 3) do not connect atoms that are polyvalent within the template model 
structure since adjusting such bonds would change the template-matching geometn,. 
15 unambiguous speciftcation of a torsion angle about a bond also requires a direction along t a. 
bond and two attached atoms. In this situation, for acyclic bonds the direction "away from the 

FIT atoms" is always well-defined. 

The following precedence rules then determine the two attached atoms. From each 
candidate atom, begin growing a -path", atom layer by atom layer, including all branches but 
20 ending whenever another path is encountered (occurrence of ring closure). At the end of the 
bond that is closer to the FIT atoms, choose the attached atom beginning the shortest path to 
any FIT atom. If there are several ways to choose the atom, ftrst choose the atom wtth the 
lowest X If there are still several ways to choose the atom, choose next the atom with the 
lowest Y and Hnally. if necessary, the lowest Z coordinate (coordinate values dtffenng by 
25 some sm^l, value, typically less than 0. 1 Angstroms, are considered as identical,. At the other 
end of the bond, choose the atom beginning the path that contains any ring. When more than 
one path con^ns a ring, choose the atom whose path has the most atoms. If there are several 
ways to choose the path, in precedence order choose the path with the highest sum of atomtc 
weights, and fnally, if still necessary, the atom with the highest X. then highest Y. then 
30 highest Z coordinate. The new setting of the torsional value depends only on whether the 
bonds to the chosen atoms are cyclic or not. If neither are cyclic, the setting is 180 degrees; 
if one is cyclic, the setting is 90 degrees; and if both are cyclic, the setting is 60 degrees. Any 
steric clashes that may result from these settings are ignored. 
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As an illustrative example, consider generation of the topomeric conformer for the side 
chain shown in Figure 6(A), in which atom 1 is attached to some core structure by the upper 
left- most bond. Assuming that the alignment template for this fragment involves atom 1 only, 
there are three bonds whose torsions require adjustment, those connecting atoms pairs 1 - 3; 
5 5 - 8; and 10 - 14. (Adding atom 3 to the alignment template would make atom 1 "polyvalent 
within the template model structure", so that the 1 - 3 bond would then not be altered.) The 
atom whose attached atoms will move (in the torsion adjustment) is the second atom noted in 
each atom pair. For example, if a torsional change were applied to the 14 - 10 bond instead 
of the 10 - 14 bond as shown in Figure 6 A, all of the molecule except atoms 10, 14 and 15 
10 (and 13 by symmetry) would move. Correspondingly, if a torsional change were applied to the 
10 - 14 bond instead of the 14 - 10 bond, only atom 15 would move. 

To define a torsional change, atoms attached to each of the bonded atoms must also be 
specified. For example, setting torsion about the bond 5 - 8 to 60 degrees would yield four 
different conformers depending on whether it is the 6-5-8-13, 6-5-8-9, 4-5-8-9, or 4-5-8-13 
15 dihedral angle which becomes 60 degrees. To make such a choice, "paths" are grown from 
each of the candidate atoms, in "layers", each layer consisting of all previously unvisited ^ms 
attached to any existing atom in any path. In choosing among the four attached-atom 
possibilities of the 5 - 8 bond. Figure 6(B) shows the four paths after the first layer of each 
is grown, and Figure 6(C) shows the final paths. In Figure 6(C), notice within the rings that, 
20 not only is the bond between 3 and 7 not crossed, but also atom 1 1 is not visited because the 
third layer seeks to include 1 1 from two paths, so both fail. The attached atoms chosen for 
the torsion definition becomes the ones that begin the highest-ranking paths according to the 
rules stated above. For example, in Figure 6(C), attached atom 4 outranks atom 6 because its 
path is the only one reaching the alignment template, and atom 9 outranks atom 13 because 
25 its path has more atoms, so that it is the 4-5-8-9 torsion which is set to a prescribed value. 
For the same reasons, the other complete torsions become 9-10-14-15, attached 1-3-4 and 
attached 1-2-16. The other decision rules would need to be applied if atom 9 was, instead of 
carbon, an aromatic nitrogen (with the consequent loss of the attached hydrogen) so that the 
9 and 13 paths have the same number of atoms. In this case, the 9 path still takes priority, 
30 since it has the higher molecular weight. If instead atom 14 is deleted, so that the 9 and 13 
paths are topologically identical, the 9 path again takes priority because atom 9 has the same 
X coordinate but a larger Y coordinate than does atom 13. 

As for the dihedral angle values themselves, torsion 4-5-8-9 is set to 60 degrees. 
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because both the 4-5 and 8-9 bonds are within a ring; torsions 9-10-14-15 and attached -1-3-4 
become 90", because only the 3-4 and 9-10 bonds respectively are cyclic; and the attached -1- 
2-16 dihedral becomes 180° since none of the bonds are cyclic. It should be noted that this 
topomeric alignment procedure will not work with molecules containing chiral centers since, 
5 for each chiral center, two possible three dimensional configurations are possible for the same 
molecule, and, clearly, each configuration by the above rules would yield a different topomeric 
conformer. 

\\ .Sppr.iali7,ed AUignment for Chiral an d F^uivalent Atoms 
In order to resolve the ambiguity introduced by a chiral center or centers in a molecule, 
10 a specialied topermic allignment rule must be adopted. Figure 6(D) shows a side chain whose 
attachment atom is marked as "Roof and in which atom I is chiral. Atom I has four 
non-equivalent attachments, indicated by Root, J, K, and L. Although the absolute 
configuration of such a chiral atom is not usually specified, an allingment methodology of an 
explicit 3D model must necessarily consistently select one of the two possible conformations, 
15 even if arbitrarily chosen. Proceeding as taught above, generating the topomeric conformation 
for the side chain leads to selection of atom J (the largest of the attachments rooted by J, K, 
and L) as the atom defining the Root-I torsion and thus fixes the position of J. However 
relative positions of K and L remain ambiguous. Unless such "prochiral" atoms (including 
pyramidally hydrolyzed nitrogen) are recognized and a configuration explicitly assigned, side 
20 chains which are topologically identical may seem to be very different in shape. 

The procedure used to make sure that the actual topomeric 3D models generated around 
chiral centers are as similar as possible is as follows: first, form a list of all such chiral centers 
including pyramidal nitrogen (many algorithms for doing this are described in the literature and 
are found in any modelling software); second, after an individual torsion has been set, as 
25 described eariier , if the third atom of the four in the torsion list is one of the chiral centers, 
[in Figure 6(D) the configuration of atom I will be adjusted just after the torsion about Root-I 
has been set] proceed to replace the fourth atom on the torsion list [J in Figure 6(D)] with the 
next highest attachment atom [following the earlier description this will be atom K in Figure 
6(D)]. If the dihedral angle value for the new torsion is greater than 180 degrees, then the 
30 reative position of atoms K and L must be exchanged To exchange the positions of atoms K 
and L, generate the plane defined by the second (Root) through fourth (J) atoms on the torsion 
that was initially set. Finally, reflect the coordinates of all the atoms attached to the third atom 
(I) through that plane. This topomeric procedure will generate a consistent topomeric 
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allignment for all side chains containing chiral centers. 

A second specialized topomeric allignment problem which may be encontered is the 
requirement to select between two equivalent atoms. This situation is also illustrated in Figure 
6(D) where there are two candidate attachment atoms, "A" and "a", for the torsion 
5 A(a)-B-C-D. Topologically atoms "A" and "a" are identical, but a different position for the 
fxve-membered ring, hence a very different shape, will be generated depending on whether "A" 
or "a" is used to assign the torsion of A(a)-B-C-D. The following rule is used to ensure that 
the choice between "A" and "a" is made consistently. Measure the two dihedral angles defined 
by the atom lists Root-B-C-A amd Root-B-C-a. (Although these atoms are obviously not 
10 directly connected, the dihedral angle values are well-defined.) Of the two possibilities, select 
the atom to define the torsion for which the torsional value lies between 170 and 350 degrees. 

Using the selection rules set out above, the critical point is that the use of a single 
topomerically aligned conformer in computing a CoMFA three dimensional descriptor has been 
found to yield a validated descriptor. While other approaches to conformer selection such as 
15 averaging many representative conformers or classifying a representative set by their possible 
interactions with a theoretically averaged receptor (such as in the polyomino docking) are 
possible, it has been found that topomerically aligned conformers yield a validated descriptor 
which, as will be seen below, produces clustering highly consistent with the accumulated 
wisdom of medicinal chemistry. 
20 R ralciilation Of CoMFA and Hvdropen Bond ing Fields 

The basic CoMFA methodology provides for the calculation of both steric and 
electrostatic fields. It has been found up to the present point in time that using only the steric 
fields yields a better diversity descriptor than a combination of steric and electrostatic fields. 
There appear to be three factors responsible for this observation. First is the fact that steric 
25 interactions - classical bioisosterism - are certainly the best defined and probably the most 
important of the selective non-covalent interactions responsible for biological activity. Second, 
adding the electrostatic interaction energies may not add much more information since the 
differences in electrostatic fields are not independent of the differences in steric fields. Third, 
the addition of the electrostatic fields will halve the contribution of the steric field to the 
30 differences between one shape and another. This will dilute out the steric contribution and also 
dilute the neighborhood property. Cleariy, reducing the importance of a primary descriptor is 
not a way to increase accuracy. However, it is certainly possible that in a given special 
situation the electrostatic contribution might contribute significantly to the overall "shape". 
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Under these unique circumstances, it would be appropriate to also use the electrostatic 
interaction energies or other molecular characterizers, and such are considered within the scope 
of this disclosure. For instance, in some circumstances a topomeric CoMFA field which 
incorporates hydrogen bonding interactions, characterized as set forth below, may be useful. 
5 The steric fields of the topomerically aligned molecular side chain reactants are 

generated almost exactly as in a standard CoMFA analysis using an sp' carbon atom as the 
probe. As in standard CoMFA, both the grid spacing and the size of the lattice space for which 
data points are calculated will depend on the size of the molecule and the resolution desired. 
The steric fields are set at a cutoff value (maximum value) as in standard CoMFA for lattice 
10 points whose total steric interaction with any side-chain atom(s) is greater than the cutoff 
value. One difference from the usual CoMFA procedure is that atoms which are separated 
from any template-matching atom by one or more rotatable bonds are set to make reduced 
contributions to the overall steric field. An attenuation factor (1 - "small number"), preferably 
about 0.85, is applied to the steric field contributions which result from these atoms. For atoms 
15 at the end of a long molecule, the attenuation factor produces very small field contributions 
(ie: [0.85]'') where N is the number of rotatable bonds between the specified atom and the 
alignment template atom. This attenuation factor is applied in recognition of the fact that the 
rotation of the atoms provides for a flexibility of the molecule which permits the parts of the 
molecule furthest away from the point of attachment to assume whatever orientation may be 
20 imposed by the unknown receptor. If such atoms were weighted equally, the contributions to 
the fields of the significant steric differences due to the more anchored atoms (whose 
disposition in the volume defined by the receptor site is most critical) would be overshadowed 
by the effects of these flexible atoms. 

The derivation of a hydrogen-bond field is slightly different from the standard CoMFA 
25 measurement. The intent of the hydrogen-bonding descriptor is to characterize similarities and 
differences in the abilities of side chains to form hydrogen-bonds with unknown receptors. 
Like the successful use of the topomeric conformation to characterize steric interactions, the 
topomeric conformation is also an appropriate way to characterize the spatial position of a side 
chain's hydrogen-bonding groups. However, unlike a steric field, hydrogen-bonding is a 
30 spatially localized phenomenon whose strength is also difficult to quantitate. Therefore, it is 
appropriate to represent a hydrogen-bonding field as a bitset, much like a 2D fingerprint, or 
as an array of 0 or 1 values rather than as an array of real numbers like a CoMFA field. 

The hydrogen-bonding loci for a particular side chain are specified using the DISCO 
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approach of "extension points" developed by Y. Martin'^ and coworkers, wherein, for 
example, a carbonyl oxygen generates two hydrogen-bond accepting loci at positions found by 
extending a line passing from the oxygen nuclei through each of the two "lone-pair" locations 
to where a complementary hydrogen-bond donating atom on the receptor would optimally be. 

5 It is not possible with a bitset representation to attenuate the effects of atoms by the number 
of intervening rotatable bonds. Instead, uncertainty about the location of a hydrogen-bonding 
group can be represented by setting additional bits for grid locations spatially adjacent to the 
single grid location that is initially set for each hydrogen-bonding locus. In other words, each 
hydrogen-bonding locus sets bits corresponding to a cube of grid points rather than a single 

10 grid point. The validation results shown in Table 4 were obtained for a cube of 27 grid 
locations for each hydrogen bonding locus. The single bitset representing a topomeric 
hydrogen-bonding fingerprint has twice as many bits as there are lattice points, in order to 
discriminate hydrogen-bond accepting and hydrogen bond-donating loci. The difference 
between two topomeric hydrogen-bonding fingerprints is simply their Tanimoto coefficient 

15 which now represems a difference in actual field values. Software which implements the 
hydrogen-bonding field calculations is provided in Appendix "B". 
C. Validation Of Topomeric Co MFA Descriptor 

The validity of topomerically aligned CoMFA fields as a molecular structural 
descriptor, which can be used to describe the diversity of compounds, was confirmed on 

20 twenty data sets randomly chosen from the recent biochemical literature. The data sets spanned 
several different types of ligand-receptor binding interactions. The only criteria for the data 
sets were: 1) the reported biological activities must span at least two orders of magnitude; 2) 
the structural variation must be "monovalent" (only one difference per molecule); 3) the 
molecules contain no chiral centers; and 4) no page turning was required for data entry in 

25 order to reduce the likelihood of entry errors. Each data set was analyzed independently. The 
identification of the data sets is set forth in Appendix "C". The structural variations of the side 
chains of the core templates were entered as the Sybyl Line Notations of the corresponding 
thiols. (Sybyl Line Notations [SLNs] define molecular structures.) An -SH was substituted for 
the larger common template portion of each molecule and provided the two additional atoms 

30 needed for 3D orientation. According to the validation method of this invention the Patterson 
plots constructed as discussed above for the twenty data sets are shown in Figures 7(a) - 7(t). 

In 17 of the 20 cases, visual inspection of the plots suggests that the density of points 
in the lower right trapezoid is, indeed, greater than the density in the upper left triangle as 
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predicted for a metric descriptor obeying the neighborhood rule. Also, for reasons noted 
earlier, some points do fall above the line as would be expected for the real world. However, 
the relative rarity of points in the upper left triangle of the plots indicates that "small steric 
field differences are not likely to produce large differences in bioactivity", the neighborhood 

5 rule. Thus, the distribution of points in the Patterson plots across all the randomly selected 
data sets is remarkably consistent with the theoretical prediction for a valid/useful diversity 
metric. It can be easily seen that the topomeric CoMFA metric is validated/ useful. 

Table 1 contains the density ratios from the quantitative analysis of the twenty data sets. 
The density ratios of the two test metrics (random number assignments and molecular force 

10 field energy divided by number of atoms for the diversity descriptor values) described earlier 
are presented for comparison. values reflecting the statistical significance of the ratios are 
also set forth next to the corresponding ratios. 
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TABLE 1 
Patfpr^;nn Pl"t Latins apH A«:snr.iated 



No. 



25 



Reference 



CoMFA 
Ratio 



Uehling 



2 Strupczewski 



3 Siddiqi 



4 Garratt-l 



5 Garratt-2 



6 Heyl 



7 Cristalli 



8 Stevenson 



9 Doherty 

10 Penning 



11 Lewis 



15 12 Krystek 



13 Yokoyama-1 



14 Yokoyama-2 



15 Svensson 



16 Tsutsumi 



17 Chang 



18 Rosowsky 



19 Thompson 



20 Depreux 



MEAN 



STND. 
DEVIATION 



CoMFA 



1.71 



1.39 



1.44 



1.72 



1.37 



1.04 



1.40 



0.95 



1.63 



1.45 



0.95 



1.64 
1.18 



1.23 



1.27 



1.38 



1.34 



1.71 



1.47 



1.22 



1.38 



Random 
Ratio 



10.27 



57.33 



0.24 



6.26 



13.01 



8.02 



0.08 



51.21 



0.02 



3.54 



10.33 



0.04 



119.92 



i.88 



2.62 



3.72 



6.50 



45.55 



12.46 



3.96 



10.85 



18.38 



29.43 



Random 
X^ 



0.98 



1.01 



0.92 



1.02 



1.04 



0.99 



1.00 



0.98 



1.02 



0.99 



1.05 



1.00 



1.00 



1.02 



1.04 



0.94 



1.01 



0.95 



1.06 



0.98 



1.00 



0.04 



0.01 



0.02 



0.01 



0.02 



0.11 



0.01 



Energy 
Ratio 



Energy 
X^ 



0.00 



0.98 



0.97 



1.00 



0.97 



0.00 



0.01 



0.97 



0.96 



0.01 



0.05 



0.00 



0.00 



0.02 



0.00 



0.98 



0.96 



1.00 



0.97 



0.97 



0.93 



0.99 



0.99 



0.02 



0.12 



0.10 



0.09 



0.07 



0.03 



0.04 



0.96 



0.99 



1.00 



1.00 



0.98 



0.02 



0.02 



0.47 



0.00 



0.07 



0.05 



0.46 



0.01 



0.02 



0.00 



0.02 



0.49 



0.41 



0.01 



0.00 



0.06 



0.03 



0.00 



0.00 



0.12 



0.19 



* Data sets 3 and 20 are not reported for the force field energy because one of the 
structures in each data set (in the topomeric conformation) had a very strained energy 
greater than 10 kcal/ mole-atom, which produced a discontinuously large metnc 

difference. 
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The chi-squared distributions for 1 degree of freedom are: 

^xc,;f. 3;S Z 

Typically, a confidence level of 95% is considered appropriate in sutistical measures 

A metric is considered valid/useful for an individual data set if .he Patterson plot raUo 
greater than 1.1; that is, there is greater than a 10% difference in the density between *e 
ULT and LRT. The use of 1.1 as a decisional criteria is confirmed by an examinatton of the 
scatter diagrams of X' values versus their corresponding rados as shown in Figures 8A and 
8B CThe value of X is actually plotted in Figure 8B in order to separate the data potnts. 
Figure 8A shows the plot of X's having a value of greater than 3.84 (95% confidence htn,^) 
versus their corresponding ratios, while Figure 8B shows the plot of X^s (plotted as J X ) 
having a value less than 3.84 versus thetr corresponding ratios. A ratio value of greater than 
1 1 (Figure 8A) clearly includes most of the statistically significant ratios, while a ratto value 
of less than 1 . 1 clearly includes most of the statistically insignificant ratios. While thts ts no. 
,5 a perfect dividing point and there is some overlap, there is also some distortion of the X 
values due to limited population sizes as discussed below. Overall, the value of 1.1 provtdes 

a reasonable decision point. 

AS noted earlier, the validity of a metric should no. be de.e,mined on the bas.s of one 
data set from the literature. A single literature dau se, usually presents only a limited range 
20 of structure/activity data and examines only a single biological activity. To obtain a proper 
sense of the overall validity/quality of a metric, its behavior over many data seu representtng 
many different biological activities must be constdered. It should be expected for randomly 
selected data sets .ha. due to biological variabtlity, an otherwise valid meiric may appear 
invalid for some particular set. An examination of the data in Table 1 confirms thts 

25 observation. r 

Except for data sets 6, 8, and 11, the ratios in Table 1 clearly confirm for the 
topomeric CoMF A mettle that the density of points in the LRT is greater .han in the ULT, and 
the X' values confirm the significance of the plots. At the same time, the data for the two test 
metncs clearly demonstrates with great sensitivtty that this valtdation technique yields exactly 
the results expected for a meaningless metric; specifically, a density ratio substantially equal 
,o 1 and no significance as determined by the X^ test. Contrary to accepted notions in .he poor 
ar. with the discovery of this invention, random literature data sets can be used to val,da.e 
me'trics The type of publicly unavailable data set (as will be discussed in relation to the Abbott 
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da« set below) where the bioacUvity or inactivi.y for each molecule in the set has been 
experimentally verified is not required. 

Sets 6 8 and 11 axe the exceptions which help establish the rule. It is reahsuc to 
expect that randomly selected data sets would include some where molecular edge (typically 
5 a collision with receptor atoms) or other distorting effects would be present. For set 6. one 
experimental value was so inconsistent with other reported values that the authors even called 
attention to that fact. In addition to a problematic experimenml value, all the structural changes 
are rather small but some of the biological changes are fairly large. Something very unusua^ 
is clearly happening with this system. For set 8, there is simply not enough data. Only 5 
10 compounds (.0 differences) were included and this proved insufficient to analyze even wtth 
the sensitivity of the Panerson plot. For data set U , there were two contributing factors. F.rst, 
the data set was small (only 7 compounds). Second, this set is a good example of an edge 
effect where a methyl group protruding from the molecules interacts with the receptor s.te tn 
a unique manner which dramatically alters the activity 
15 Generally, the X' values support the signiflcance (or lack of significance) of the ratio 

values However, for daui sets 9, 13, 14, and 15 the 95% confidence Hmit is not met. As wift 
all statistical tests, X' is sensitive to the sample size of the population. For these data sets the 
N was simply too low. This sensitivity is well demonstrated by the difference m X' for sets 
14 and 20. The ratio values of the two sets are virtually identical, but the X's differ 
20 significaiitly since set 14 has few points and set 20 many points. Thus, X' may be used to 
confirm the significance of a ratio value, but, on the other hand, can not be used to discredit 
a ratio value when too few data points are presem. 1. can be cleariy seen that the topomenc 
CoMFA metric appears to define a useful dimensional space (measures chemistry space) better 
for some of the target sets than for others. 
25 As was discussed above, a metric need not be perfect to be valid. Even using an 

imperfect metric significantly increases the probability that molecules can be property 
characterized based on structural differences. As the quality of the metric increases, the 
probability increases. Thus, metrics which appear valid by the above analysis with respect to 
only a few test data sets are still useful. Metrics, like topomeric CoMFA, which are valid for 
30 85% (17/20) of the data sets yield a higher probability that structurally diverse molecules can 
be identified. 

only with respect to data sets 6, 8, and 11 does the topomeric CoMFA metnc not 
appear to provide a useful measure. Considering the fact that some of the data sets have 
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limited samples and that a very wide range of biological interactions is represented, it is not 
unexpected that random variations like this will appear. The critically important aspect of this 
analysis is the fact that the metric is valid over a truly diverse range of types of ligand- 
substi^te interactions. This strongly confirms its generally applicability as a valid measure of 
the diversity of molecules which can be used to select optimally diverse molecules from large 
data sets such as for use in combinatorial screening library design. 

Another important aspect of the invention can be derived from these plots. Upon close 
examination it can be seen that molecules having topomeric CoMFA differences (distances) 
of less than approximately 80 - 100 generally have activities within 2 log units of each other. 
This provides a quantitative definition of the radius of an area encompassing molecules 
possessing similar characteristics (similarly diverse) in topomeric CoMFA metric space - the 
neighborhood radius. Because the topomeric CoMFA metric is a valid molecular structural 
descriptor it is known that molecules with similar structure and activity will cluster m 
topomeric COMFA space. Topomeric CoMFA distances can, therefore, be usefully used as a 
diversity measure in selecting which molecules of a proposed combinatorial synthesis should 
be retained in the combinatorial screening library in order to have a high probability that most 
of the diversity available in that combinatorial synthesis is represented in the library. Thus, 
for a combinatorial screening library, only one example of a molecular pair having a pairwise 
distance from the other of less than approximately 80 - 100 kcal/mole (belonging to the same 
diversity cluster) would be included. However, every molecule of a pair having a pairwise 
distance greater than approximately 80 - 100 would be included. Of course, the "fineness" of 
the resolution (the radius of the neighborhood in metric space) can be changed by usmg a 
different activity difference. The Patterson plot permits by direct inspection the determmation 
of a neighborhood distance appropriate to any chosen biological activity difference. It is 
25 suggested, however, that for a reasonable search of chemistry space for biologically sigmficant 
molecules, a difference of 2 log units is appropriate. The exact value chosen be adjusted to the 
circumstances. Cleariy, the opportunity for real world perturbing effects to dominate the 
measure is magnified by using less than 2 log units difference in biological activity. This is 
another example of the general signal to noise ratio problem often encountered in 
30 measurements of biological systems. For more accurate signal detection less perturbed by 
unusual effects, the data sets would ideally contain biological activity values spread over a 
wider range than what is usually encountered. The neighborhood radius predicted from an 
analysis of the topomeric CoMFA metric can now be used to cluster molecules for use in 



20 



40 



selecting .hose of similar strucure and acivity (such as is desired in designing a combinau>rial 
screening library of optimal diversity). 

The teachings of this disclosure so far may be summarized as follows: 1) a 
generali^le method for validating metric descriptors has been taught; 2) a specif.c dcscnptor, 
5 Ipomeric CoMFA, has been described; and 3, the topomeric CoMFA descriptor has been 
validated over a diverse sampling of different types of biological interactions from pubhshed 
d^tE sets 

The extraordinary power inherent in the validation method to quantitatively determine 
a signincant neighborhood radius is further demonsuated by a temarkable result obtained m 
,0 the analysis of a data set of potential reactants for a combinatorial synthes,s (all 736 
commercially av^lab.e thiols) from the chemical literatute. The results were obtained by 
■■complete linkage'' hierarchical cluster analysis of the resulting steric Held matrices ustng 
■■CoMFA STD- or -NONE" scaling. (CoMFA_STD implies block standardization of each 
field but"wi,hou, rescaling of the individual "columns" corresponding to particular latnce 
15 poinis, which here produces the same clusters as no scaling). For clustering the "dtstance" 
between any two molecules is calculated as the root sum of the squared differences ,n steric 
field values over all of the lattice intersecuons defined by the CoMFA ■■region". 

in this example, cluster analysis using topomeric CoMFA fields produced a 
classification of reagents that makes sense to an experienced medicinal chemist. For example. 
20 when the topomerically aligned CoMFA fields of the 736 thiols are clustered, *PP -J-^ 
the sm^lest dist^ce between clusters is about 91 kcal/mole (within the ■■neighborhood 
distance of 80-100 found for these fields in the validation studies), 231 discrete clusters result 
differing from each other in steric size by at least a -CH,- group. Upon inspection of the 
clustering, an experienced analyst will immediately recognize that at this clustering level of 
25 231 a natural break occurs, ie: the separation between cluster level 231 and level 232 was 
greater than any encountered between levels 158 and 682. Further inspecrton of these results 
showed that, with perhaps .en exceptions, each cluster contained only compounds havng a 
very similar 2D topology or connectivity, while different clusters always contained compounds 
bavmg dissimilar 2D topology. Indeed, so logical was the grouping that it was poss.ble to 
30 provide a characteristic and distinctive systematic name for each of the 238 clusters using 
mostly traditional or 2D chemical nomenclature as shown in Appendix "D''. It is striking that 
this entirely automatic clustering procedure, based only on differences among the topomenc 
stcnc fields of 3D models of single conformers. generates a classification that coincdes so 
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well wi* chemical experience as embodied in an inde^ndentiy generated 2D nomencla.ure. 
Prom a pragmaUc point of view, .his resuU may also be said .o validate .be val,da..o„ 
procedure in .be eyes of an experienced medicinal chemist who wiU .end .o judge a metnc by 
whether i.s assessment of molecular similarity and diversity agree with h,s/ber own 

"iTe critical aspect of this clustering result is that the structurally most logical clustering 
was generated with a nearest neighbor separation of 91. in the middle of the 80 - IW 
neighborhood dist^ce determined from the valida.ion procedure .o be a goKH, measure of 
sinfilariiy among the molecules in topomeric CoMFA metric space. That is, the ne,ghborho<^ 
distance of approximately 80 - 1(X) (corres^nding to an approximate 2 log b.o ogtcal 
difference) predicted from the topomeric CoMFA validation, generates, when used ,n a 
clustering analysis, logical systematic groupings of similar chemical structures, m exact s,.e 
of ,he neighborhocK, radius useful for clustering analysis will vary depending upon: 1 the log 
range of activity which is .0 be included; and 2) .he me,ric used since, in the real wor d, 
different metrics yieid different distance values for the same differences in biologtcal acttvtty. 
AS seen the topomeric CoMFA metric be used to distinguish diverse molecules from one 
another - the very quantotive def.nition of diversity lacking in the prior a„ which is necessary 
for the rationale construction of an optimally diverse combinatorial screening library. 

The discovered validaiion method of this invention is no. limi.ed .o fte topomenc 
, COMFA field metric but is generalizable to any metric. Thus, once any metric is constructed 
its validity can be tested by applying the metric to appropnate literature data sets and 
generating *e corresponding Pa.terson plots. If the me.ric displays the neighborhood behavor 
and is valid/useful according to the analysis of Ute Pa..erson plots set forih above, the 
neighborhood radius is easily determined from the Patterson plots once an activity dtf erence 
5 is selected. This neighborhood rad.us can then be used to stop a clustering analysts when the 
distance between clusters approaches the neighborhood radius. The resulting clusters are then 
representative of differem aspects of molecular diversity with respect .o .he clustered 
property/metric. 1. should be no.ed .ha. a metric, by definition, is only used ,„ describe 
something which has a difference on a measurement scale. This necessarily implies a 
,0 -distance- in some coordinate system. Mathematical transformations of the dis^ces yielded 
by any me.ric are s.ill -disU^ces" and c^ be used in .he prepara.ion of .he Pa.terson plots. 
For instance, the topomeric CoMFA field distances could be transformed into pnncpal 
component scores and would still represent the same measure. 
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Since the validity of the metric is not dependent on the particular chemical/biological 
assays used to establish its validity, the metric can be applied to assemblies of chemical 
compounds of unknown activity. Clustering of these assemblies using the validated 
neighborhood radius for the metric will yield clusters of compounds representative of the 
5 different aspects of molecular diversity found in the assemblies. (It should be understood that 
active molecules for any given assay may or may not reside in more than one cluster, and the 
cluster(s) containing the active compound(s) in one assay may not include the active 

compound(s) in a different assay.) 

As mentioned above, when designing an efficient combinatorial screening library, one 
10 wishes to avoid including more than one molecule which is representative of the same 
structural diversity. Therefore, if a single molecule is included from each cluster derived as 
above, a true sample of the diversity represented by all the molecules is achieved without 
overlap. This is what is meant by designing a combinatorial screening library for optimal 
diversity. The methodologies of the present invention for the first time enable the achievement 

15 of such a design. 

S Tanimoto Fin per print D escriptor 

There are other measures of molecular similarity which are not metrics, that is, they 
do not correspond to a distance in some coordinate system but for which differences between 
molecules can be calculated. One such measure is the Tanimoto" fingerprint similarity 

20 measure. This is one of the 2D measurements frequently used in the prior art to cluster 
molecules or to partially construct other molecular descriptors. (Technically descriptors 
containing a Tanimoto term are not metrics since the Tanimoto is not a metric). 2D fingerprint 
measures were originally constructed to rapidly screen molecular data bases for molecules 
having similar structural components. For the present purposes, a string of 988 has been found 

25 convenient and sufficiently long. A Tanimoto 2D fingerprint similarity measure (Tanimoto 
coefficient) between two molecules is defined as: 

No. Of Bits Occuring € Both Molecules 
No. Of Bits e Either Molecule 

The Tanimoto fingerprint simply expresses the degree to which the substructures found in both 
compounds is a large fraction of the total substructures. 
A Neiphborhood Property 
30 At an American Chemical Society meeting in April, 1995, Brown, Martin, and Bures^ 
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of Abbott Laboratories presented clustering data generated in an attempt to determine which, 
if any, of the common descriptors available in the prior art produced "better clustering". 
"Better clustering" was defined as a greater tendency for active molecules to be found in the 
same cluster. One of the measures used was the Tanimoto 2D fingerprint coefficient calculated 
5 from the structures of the entire molecules (not just the side chains). Proprietary and publicly 
unavailable data sets were used by the Abbott group which covered a large number of 
compounds for which the activity or lack of activity in four assays had been experimentally 
verified over many years of pharmacological research. Although used as an analytical tool to 
measure clustering effectiveness and not itself a focus of the presentation, one of the graphs 
10 Martin presented plotted the "proportion of molecular pairs in which the second molecule is 
also active" against the "pairwise Tanimoto similarity between active molecules and all 
molecules" (hereafter referred to as a "sigmoid plot"). From the resulting graph Martin et al. 
essentially found that if the Tanimoto coefficient of molecule A (an active molecule) with 
respect to molecule B is greater than approximately 0.85, then there was a high probability that 
15 molecule B will also be active; ie. , the activity of molecule B can be usefully predicted by the 
activity of molecule A and vice versa. While not recognized or taught by the Abbott group at 
the time, the present inventors recognized that, for a very restricted data set, the Abbott group 
had data suggesting that the Tanimoto coefficient displayed a neighborhood property. 
R. Applicahilitv Of Tanimoto To D ifferent Biolopical Systems 
20 In order to determine whether the Tanimoto coefficient reflects a neighborhood property 

over a range of different biological assays, 1 1 ,400 compounds from Index Chemicus containing 
18 activity measures with 10 or more structures were analyzed. (Index Chemicus covers novel 
compounds reported in the literature of 32 journals.) Lack of a reported activity was assumed 
to be an inactivity although, in reality, the absence of a report of activity probably means that 
25 the compound was just untested in that system. For comparison purposes, this assumption is 
a more difficult test in which to discriminate a trend than with the Abbott data base where it 
was experimentally known whether or not a molecule was active or inactive. However, all that 
is absolutely needed for this analysis is a high likelihood of having compounds that are "similar 
enough" in fingerprints to also be "similar enough" in biological activity. The converse, 
30 "similar biological activity must have similar fingerprints" , is patentiy untrue and is not tested. 
Table 2 shows the structures and activities analyzed. 
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TABLE 2 
Index Chemiciis Activities 



10 



Set 


No. 


Biological 




Set 


No. 


Biological 




rviioj.. 


Aftivitv 




No. 


Anal. 


Activity 


1 
1 




Antiananhvlactic 




11 


18 


Cytotoxic 


£0 


12 


Antiasthmatic 




12 


133 


Enzyme Inhibiting 




71 


Antibacterial 




13 


210 


Nematocidal 


A 


16 


A n ticholinereic 




14 


12 


Opioid Rcptr. Bind 


5 


55 


Antifungal 




15 


39 


Platelet Aggr. Inh. 


6 


17 


Anti-inflammatory 




16 


11 


Radioprotective 


7 


21 


Antimicrobial 




17 


13 


Renin Inhibiting 


8 


13 


B-adrenergic 




18 


11 


Thrombin Inhib. 


9 


21 


Bronchodilator 










10 


34 


Ca Antagonistic 











15 To convert this data to sigmoid plots, the data lists were examined for everything which 

was active, and a Tanimoto coefficient calculated (on the whole molecule) between every 
active molecule and everything else in the list. For plotting, the value of the number of 
molecules which were a given value (X) away from an active compound was determined. The 
proportion (frequency of such molecules) was plotted on the vertical axis and the Tanimoto 

20 coefficient on the horizontal axis. The bin widths for the X axis are 0.05 Tanimoto difference 
units wide, and the activity from Index Chemicus was simply "active" or "inactive". Figures 
9 A and 9B show the resulting plots for 16 of the 18 data sets broken down into sets of 8 
(replication of these Figures in the priority applications did not pick up the ninth curve in each 
Figure, so that the ninth curve in each set has been ommitted from this application). Many of 

25 the curves have a sigmoid shape, but the inflection points clearly differ. Also, it is not clear 
what effect excluding the differences between active and inactive molecules has on the shape 
of the curves. To get an overall view, Figure 9C shows the cumulative plot for both series of 
9 activities. This plot generally indicates that, given an active molecule, the probability of an 
additional molecule, which falls within a Tanimoto similarity of 0.85 of the active, also being 

30 active is, itself, approximately 0.85. Stated slightly differently, when a Tanimoto similarity 
descriptor is summed over an arbitrary assortment of molecules and biological activities, it is 
clear that molecules having a Tanimoto similarity of approximately 0.85 are likely to share the 
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same activity. Thus, the Tanimoto similarity displays a neighborhood behavior (neighborhood 
distance of approximately 0.15) when applied to a large enough number of arbitrary sets of 
compounds. As will be discussed later, one of the more powerful aspects of the Patterson plot 
validation method is that it can provide a relative ranking of metrics and distinguish on what 
5 type of data sets each may be more useful. In this regard, it will be seen that the whole 
molecule Tanimoto coefficient as a diversity descriptor has unanticipated and previously 
unknown drawbacks. 

However, one of the principle features of the present invention, neither taught by the 
Abbott researchers nor recognized by anyone in the prior art, is that the Tanimoto descriptor 

10 can be used in a unique manner in the construction of a combinatorial screening library. In 
fact, as will be seen, it has been discovered that this descriptor can be used to provide an 
important end-point determination for the construction and merging of such libraries and, in 
addition, is a useful descriptor for constructing and searching the virtual library. 
C. Comparison of Sigmoid and Patterson Plots 

15 It is important to understand the difference in the types of information about descriptors 

and the neighborhood property which is yielded by the Abbott sigmoid plot and the generalized 
validation method and Patterson plot of the present invention 

To make a sigmoid plot, the molecules must be first be divided into two categories, 
active molecules and inactive molecules, based on a cut off value chosen for the biological 

20 activity. One molecule of a pair must be active (as defined by the cut off value) before the 
pair is included in the sigmoid plot. Pairs in which neither molecule has any activity, as well 
as those pairs in which neither molecule has an activity greater than the cut off value, do not 
contribute information to the sigmoid plot. Thus, the sigmoid plot does not use all of the 
information about the chemical data set under study. In fact, it uses a limited subset of data 

25 derivable from the more general Patterson plot described above. As a consequence very large 
sets of data (or sets for which both the activity and inactivity in an assay are experimentally 
known) are needed to get statistically significant results from the sigmoid plots. 

By comparison, the Patterson plot clearly displays a great deal more information 
inherent in the data set which is relevant to evaluating the metric. Most importantly, the 

30 validity and usefulness of the metric can be quickly established by examining the Patterson 
plots resulting from application of the metric to random data sets. As will be shown in the next 
section, a metric may reflect a neighborhood property (such as in a sigmoid plot), but at the 
same time may not be a particularly valid/useful metric or may have limited utility. In 
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Patterson plot analysis, all pairs of molecules and their associated activities or inactivities 
contribute to the validity analysis and to the determinations of the neighborhood radius. Thus, 
in a Patterson plot, it is easy to see what percentage of the total data set is included when the 
neighborhood definition is changed by choosing a different biological difference range. This 
5 has important consequences for choosing the correct neighborhood radius for clustering. 

To better see the relationship between the information available from each type of plot, 
Figure lOA shows a Patterson plot for the Cristalli data set reconstructed under the Abbott 
sigmoid plot simplification that the 32 molecules were either "active" (activity = 1) or 
"inactive" (activity = 0). The cut off value for biological activity was chosen to be 60 /itM. 

10 Thus, "active" molecules were those with an Al agonist potency of 60 fxM or less, and 
"inactive" molecules were those with a potency greater than 60 /xM, With this Abbott 
simplification, only two differences in bioactivities can occur for a pair of molecules: both 
active or inactive, difference — 0; or one active and the other inactive, difference = 1. The 
result of constructing a Patterson plot for this impoverished data set thus must appear as two 

15 parallel lines, as shown in Figure lOA alongside the Patterson plot for the full Cristalli data 
set in Figure lOB. Although a triangle and trapezoid should still be anticipated within such a 
reduced plot, the active/inactive classification so limits the observable biological differences 
that no pattern whatsoever is apparent. The very limited nature of the information retained is 
clearly seen. In particular, by only looking at molecular pairs in which one molecule is active 

20 above a predetermined cut off value, the sigmoid plot totally fails to take into account all the 
information about the behavior of the metric with respect to non-active pairs (in which one or 
both molecules have activities less than the cut off value) contained in the distribution of points 
in the Patterson plot. As a major consequence, the Patterson plot is: 1) able to derive 
information from much less data; and 2) much more sensitive to all the nuances contained in 

25 the data. 

6. Comparison of Tanimoto and Topomeric CoMFA Metrics 

Having recognized that both the topomeric CoMFA and Tanimoto coefficient metrics 
display the neighborhood property, a comparison (between Table 1 and columns 3 and 4 of 
Table 3) of the application of the two metrics to identical data sets yields interesting insights 
30 into their respective sensitivities. The prior art practice of using the value of (1 - Tanimoto 
coefficient) as a distance was followed when performing the analysis. For columns 3 and 4 of 
Table 3, Patterson plots were constructed using the Tanimoto distances of the whole molecules 
represented in the 20 data sets which had been used for the topomeric CoMFA analysis. 



47 



Patterson plots were also constructed using the Tanimoto distances of just the side chains (as 
was done with the topomeric CoMFA metric) of the molecules for the same 20 data sets. In 
Table 3 are shown the Tanimoto fingerprint density ratios for the whole molecule and side 
chain Tanimoto metrics and the corresponding values for the 20 data sets. 
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TABLE 3 
Patterson Plot Ratios and Associated 



5 



10 



15 



20 



No. 


Reference 


Col. 1 
Side 
Chain 
Tanimoto 
Fingerprint 
Ratio 


Col. 2 
Side 

Chain 
Tanimoto 
Fingerprint 


Col. 3 
Whole 
Molecule 
Tanimoto 
Fingerprint 
Ratio 


Col. 4 
Whole 
Molecule 
Tanimoto 

11 l^Cl pi 11 1 L 

X^ 


1 
1 


ueniiiig 


1.89 


14.22 


1.55 


6.22 


z 


o ir u pcz-cvv alvi 


1.70 


143.48 


1.41 


59.61 


J 


0ICIUH4I 


1.04 


0.08 


1.04 


0.07 


A 

4 


VJdITcllt" 1 


1.60 


8.10 


1.07 


0.19 


J 


lJdITd.ll 


1.89 


36.05 


1.08 


0.50 


0 


rlcyi 


1.71 


13.83 


1.01 


0.00 


f 


r^T-i ctolli 


1.75 


144.54 


1.31 


30.27 


Q 

o 




0.94 


0.05 


1.07 


0.04 


Q 


Vi p» T*t^ \/ 

L/oneny 


1.73 


4.03 


1.05 


0.04 


in 




1.97 


37.03 


1.53 


12.73 


1 1 




1.64 


4.80 


1.01 


0.00 


1 9 




1.01 


0.04 


1.23 


16.31 




YnVovama- 1 


1.48 


9.94 


1.01 


0.00 




Ynkovama-2 


1.37 


18.94 


1.70 


16.03 


IS 
1 J 


v3 V CI 1 OoVl 1 


1.64 


16.61 


1.02 


0.02 


ID 




1.74 


21.56 


1.58 


14.35 


17 


Chang 


L34 


145.00 


1.13 


8.36 


18 


Rosowsky 


1.04 


U.Uo 


1 01 

1 . w 1 


0.00 


19 


Thompson 


1.72 


7.83 


1.17 


0.68 


20 


Depreux 


1.60 


64.22 


1.18 


6.73 




MEAN 


1.54 


34.62 


1.21 


8.61 




STANDARD 
DEVIATION 


0.32 


49.85 


0.23 


14.57 



25 50% 



Surprisingly the whole molecule Tanimoto appears to be a good descriptor for only 
of the data sets (10/20 data sets with a ratio greater than 1.1). At first glance this is 
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surprising in light of the original Abbott data, but, on second consideration, it is consistent 
with the observed significant individual variability of the plots obtained from the Index 
Chemicus analysis in Figures 9A and 9B. The Patterson plots confirm that the Tanimoto 
coefficient does display a neighborhood property for some data sets, but clearly it is less 

5 valid/useful for other sets. And it is not as consistent as the topomeric CoMFA or the side 
chain Tanimoto descriptor which were valid 85% (17/20) and 80% (16/20) of the time 
respectively. Upon inspection of the whole molecule Tanimoto data, it can be seen that the 10 
data sets which do not have ratios greater than 1.1 all have a small Tanimoto range and/or 
contain relatively few compounds. The values for these data sets also confirm the lack of 

10 statistical significance. Essentially, the whole molecule Tanimoto is a less discriminating 
diversity measurement than the others and would appear to need, at the very least, more data 
and/or a greater range of values. The method of this invention cleariy provides much more 
information and insight into the validation of the Tanimoto metric than did the Abbott style 
sigmoid plot. 

15 For the majority of sets, 80%(l6/20), the side chain Tanimoto metric also appears to 

be valid/useful. This is an extraordinarily surprising result since this metric has always been 
thought of in the prior art as useful only as a measure of whole molecule similarity. Overall, 
it compares favorably with topomeric CoMFA. A very interesting aspect, however, is that the 
sets for which validity is not apparent are not identical for the topomeric CoMFA and side 
20 chain Tanimoto metrics. The side chain Tanimoto metric does not appear valid witii respect 
to sets 3, 8, 12, and 18. Clearly set 8 had too littie data for either the topomeric CoMFA or 
the side chain Tanimoto descriptors. The most interesting comparison involves sets 3, 12, and 
18 which validated the topomeric CoMFA metric but for which the side chain Tanimoto metric 
appears invalid. Upon inspection, these sets all contained substituents in which only the 
25 position of a particular side chain varied. Since the topomeric CoMFA metric is sensitive to 
the relative spatial orientations of the side chains, while the Tanimoto metric is only sensitive 
to the presence or absence of the side chains, the sterically driven topomeric CoMFA metric 
was sensitive to the differences in these sets while the Tanimoto was insensitive. In certain 
circumstances the Tanimoto may be a useful descriptor of molecular diversity for use on the 
30 reactants in a combinatorial synthesis; a result totally at odds with the wisdom of the prior art. 
Clearly, however, the differences in sensitivities between the metrics should be considered 
when applying them. 

Further, considering the five metrics already discussed above (topomeric CoMFA, 
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whole molecule Tanimoto, side chain Tanimoto, random numbers, and force field energy) it 
is clear that the validation method of this invention can be used to rank the relative quality 
(validity/usefulness) of the metrics. In addition, when enough metrics have been examined by 
the method of this invention, it will be possible to choose metrics appropriate to the type of 
5 molecular structural differences which it is desired to analyze. Correspondingly, when a 
metric which has been validated over a very wide range of data sets and biological activities, 
yields surprising results (appears invalid) when applied to a new data set, one potential 
interpretation may be that the data are in error. This highlights another feature of the 
invention, the ability to reliably suggest that some experimental observations are generating 
10 unusual data. Instead of using a data set to validate a metric, the previously validated metnc 
is used to examine the reliability of the data set. By constructing Patterson plots and checking 
the associated value for significance, experimental scientists have another tool with which 
they may independently assess their data, especially in situations where new biological 
activities are being investigated. 
15 7 Additional Validation Results 

Considering that the validation method of this invention has shown that both the 
topomeric CoMFA metric and the Tanimoto metric define metric spaces where biological 
properties cluster (that is; the metrics are sensitive to biologically relevant molecular strucutral 
differences, a descriptor combining the two metrics was construcuted. A combined descnptor 
20 has been identified which is the best diversity descriptor discovered to date. This descnptor 
has been validated and has been found to be far superior to any previously considered metric 
in its ability to identify a neighborhood of similarity for design purposes. This descnptor, a 
weighted combination of the topomeric CoMFA descriptor and the Tanimoto descriptor, 
defines a distance measure as: 

v/(l -Tanimotof +(0.003 xtopomericCoMFAf 
25 This descriptor has a ratio greater than 1 . 1 in all 20 out of the 20 test data sets, and, in fact, 
averages a ratio of 1.55. In all 20 data sets for a neighborhood distance of 0.240 
(conesponding to a biological activity difference of 2 log units) not one single point was found 
above the line in the Patterson plot. Although this may appear as a "perfect" metnc, it is 
doubted that this level will be maintained as more and more data sets are added to the 
30 validation group. However, it is believed that it will continue to be the strongest of the 
presently known descriptors. At the present time, the results of performing validation studies 
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on the combined descriptor and other possible metrics using the Patterson plot method of this 
invention and the 20 described data sets result in the following data: 



TABLE 4 
Patterson Plot Ratios 



r 

No. 1 


■ — r 

Reference | 


HB 


1 

LOG? 1 


MR 1 


AP 1 


CONN 


AU 1 W 


rOMBO 


1 1 


Uehling 1 


1.83 


1.09 


1.07 


1.55 


1.19 


1.66 


l.o7 


2 1 


Stnipczewski | 


1.48 


1.00 


0.99 


1.40 


1.05 




1 47 




Siddiqi 1 


1.47 


0.97 


0.92 


1.00 


1.07 


1.00 


1 AO. 
i.4o 


4 1 


Garratt-1 1 


a 


1.01 


1.01 


0.90 


1.11 


1.14 




5 1 


Garratt-2 


a 


1.01 


1.00 


0.97 


1.09 


1.09 




6 1 


Heyl 


1.24 


0.98 


0.95 


1.11 


b 


1.01 


1 'J A 

1.34 


7 1 


Cristalli 


1.22 


1.06 


0.99 


1.27 


0.98 


1.1/ 


1 44 


8 1 


Stevenson | 


a 


1.03 


1.03 


1.02 


1.02 


1.02 


1 AH 

l.oU 


9J 


Doherty 


1.07 


1.00 


1.01 


1.18 


1.02 


1.28 


1 TQ 
1 . /6 


10 


Penning 


1.72 


1.00 


0.97 


1.05 


1.00 


1.36 


1 A'? 
1.0/ 


11 


Lewis 


1*0.57 


1.00 


1.02 


0.97 


1.15 


1.14 


1.0/ 


12 


Krystek 


1 1.69 


0.85 


0.85 


1.43 


1.01 


1.00 


1 O^ 


13 


Yokoyama-1 


1*0.71 


d 


1.01 


1.25 


1.01 


0.99 




14 


Yokoyama-2 


1 1.00 


1.00 


0.99 


1.25 


1.05 


U.yy 




15 


1 Svensson 


1*0.31 


1.01 


0.99 


1.31 


1.08 


1 .Uu 




16 


1 Tsutsumi 


1 1.67 


1.04 


0.95 


1.18 


1.00 


0.95 


1.52 


17 


Chang 


1 1.35 


I.OO 


1.00 


1.00 


c 


1.20 


1.36 


18 


1 Rosowsky 


1 1.44 


1.03 


0.96 


1.23 


1.08 


1.21 


1.66 


19 


1 Thompson 


1 ^ 


1.12 


0.99 


0.87 


1.02 


1.01 


1.47 


20 


1 Depreux 


1 *0.44 


1.02 


0.99 


0.99 


1.01 


0.98 


1.26 




1 MEAN 


1*1.43 


1.01 


0.98 


1.15 


1.05 


1.12 


1.55 




[STAND ARD 


*0.27 


0.05 


0.05 


0.19 


0.06 


0.17 


0.16 




1 DEVIATION 

















10 



15 



20 



25 



52 



AP 



= Atom Pairs'* 
AUTO = Autocorrelation'^ 
CONN = Connectivity Indices'* 
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HB = Topomeric Hydrogen Bonding 

LOOP = Calculated Log P 
MR = Molar Refractivity 
COMBO = Combined Topomeric CoMFA&Tanimoto 

• Asterisked values are excluded in computing the mean. These values are all artifacts, the 
result of there being no more than two distinguishable values of the molecular descriptor wtthm 
the particular series, hence only two possible values of the x variable in a Patterson plot. 
« No Hydrogen bonding groups exist to define the metric under HB 
" Too many groups for s/w to handle under CONN 

One hexavalent atom confuses the computation under CONN 
" A LOOP could not be calculated for the molecules in this data set 



combining the data from Table 4 with the data from Tables 1 and 3 permits the relative 

ranking of some known metrics: 
VALIDITY/USEFULNESS RANK: 

15 USEFUL 

Combined Topomeric Steric CoMFA and Tanimoto 

Topomeric Steric CoMFA 
Tanimoto 2D Fingerprints (Side Chain) 
Topomeric HBond Spatial Fingerprints 

20 LESS USEFUL: 

Tanimoto 2D Fingerprints (Whole Molecule) 

Atom Pairs (R. Sheridan) 
Autocorrelation 
NOT USEFUL - INVALID: 
25 Connectivity Indices 

(Health Design Implementation, first 10) 
Partition Coefficient (CLOGP) 
Molar Refractivity (CMR) 
Force Field Strain Energy 
30 Random Numbers 

Note: A denominator of less than 20 indicates that the metric could not be calculated 

for all 20 data sets. 



No Of Ra tios > 1.1 

20/20 
17/20 
16/20 
10/12 

10/20 
11/20 
9/20 

3/18 

1/19 
0/20 
0/18 
0/20 
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» r-„n,hin«orial L ^'-T TY '""i-'i"'' '^"""'"'•^ ^^'"'^ ^ ^ . 

The starting point for the design of any combinatorial screening library is the cho.ce 
of synthetic reaction scheme involving the selection of the core molecule and the possible 
reactanu v^hich could be used with specific chemistry. As mentioned earlier, well known 
and understood organic reactions are genenUly utihzed. Initially, information about the 
chemical stntcture of all the reactants (and cores, when appropriate) and the synthenc 
chemistry involved (what products can be built) is input as a database in the computer to a 
form recognizable by the computational software. Using the insights gained from fte d.scovery 
of the validation method of this invention, it is now possible to design general punx>se 
combinatorial screening libraries of optima) diversity. 

conceptually, the design process may be though, of as a filtering process tn whtch the 
molecules available in a combinatorially accessible chemical universe are run throng 
consecutive filters which remove different subsets of the universe according to spectfi«i 
criteria The goal is to filter out (reduce the numbers oO as many compounds as possible whtle 
; still retaining those compounds which are necessary to completely sample the molecular 
diversity of the combinatorially accessible universe. The basic design method of this invention 
along with several ancillary considerations is shown schematically in Figure 1 1 using the filter 
analogy. For this example only two sets of reactants are considered with one reactant of each 
set being contributed to each final product molecule. The reactants are shown forming the top 
0 row and first column of a combinatorial matrix A. Only a portion of the possible comb.natonal 
matrix is shown, the remainder being indicated by the sections connected to the matrix by dots, 
one set of reactants is represented by circles 1, and the other set by squares 2. Each empty 
matrix location represents one possible combinatorial product which can be formed from the 
two sets of reactants. (The matrix of possible products would be a recungular prism for three 
25 sets of reactants, and a multidimensional prism for higher orders of reactant sets.) As the 
design process is implemented, the number of products to be included in the screening library 
design is reduced by each filter 4. Beside each filter step is indicated the corresponding text 
section describing that filter. Also set out opposite each filtering step is an indication of the 
software and its source required to implement that step. 
30 p,.mr,val Of i;.-!.rianis For Non-r>iversitv Reasons 

in designing screening libraries derived from combinatorially accessible chemical 
universes, practical and end use considerations as well as diversity concerns can be used to 
reduce the number of reactants which will be used ,0 combinatorially specify the product 
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molecules. T.ese ptacHca, and end-use cri.rta can be divided into those of general 
applicabUity and .hose of n,ore specific appiicabi.iiy for a panicu.ar type of screen.ng hbrary 
(such as for drug discovery). The following discussion is no. n,ean. .0 be limLing. bu. ra.her 
is intended to sugges. .he .ypes of selecUons which may be made. 
, i nmeral p i-pnval Criteria 

AS a firs, consideration, reac^nts with unusual elemenu (such as .he metals) are 
normally excluded when considering .he syn.hesis of organic molecules. In addmon. 
tautomerizaUon of structures can cause problems when searching a universe of reactants da« 
hase either by missing s.n.ctures that are actually present or by finding a spectfic functto ^ 
.0 group which is really not there. The most common example of this ,s the .e,o-eno, 
Lomerism. Thus, possible .automeric reactants must be examined and .mproper for,^^ 
eliminated from consideration. Generally, reactants may be provided in solvent, as salts w, h 
counter-ions, or in hydrated forms. Before their structures can he analyzed for drve s.^ 
purposes, the salt counter-ions, solvent, and/or other species (such as water) should be 
13 removed from the molecular structure to be used. 

Additionally, reactants may contain chemical groups which would interfere w.th or 
prevent the synthettc reaction in which it is desired to use them. Clearly, either differen. 
reaction conditions must be used or these reactants removed from consideration. Someume^, 
while the synthesis may be possible, extraction of the products resu.nng from some reac|an.s 
20 may be difficult using the proposed synthetic conditions. Agatn. if possible, another synthettc 
scheme mus. be used or .he reactants removed from consideration. Price and avarlabth.y are 
not insignificant considerauons in the real world. Some reacu.n.s may need to be spec-ally 
syn.hesi«d for the combinatorial synthesis or are otherwise very expensive. In the pnor art, 
expensive reactants would typically be eUmtnated before proceeding further with the Itbrao^ 
25 design unless they were felt to be particularly advaniageous. One of the advantages of the 
method of this inveniion ,s Uta. .he decision whether .o include expensive reactants may be 
postponed unttl the molecular s.^ctures have been analyzed by a validated descriptor. W.th 
confidence that the validated descriptor permits clustering of molecules representing s.m.lar 
diversity, often another, less expensive, reactant can be selected to represent the d.verst.y 
cluster which also includes .he ex^nsive molecule. The specifics of any paritcular 
contemplated combinatorial synthesis may suggest additional appropriate filtering crrterra a. 
th,s level in Figure 1 1 the effect on the number of possible products of removing only a few 
reactants is easily seen in matrix B. For each reac^t removed, whole rows and columns of 
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possible products are excluded. 

u Rinlog i"ff"Y ^""^^ Criteria 
A library designed for screening pottnUal pharmacological agents imposes i. own 
limiutions on .he type and size of molecules. For instance, for drug discovery .o.,c or 
metahouc^ly hazardous reacan.s or those containing heavy metals (organometalltcs, wouM 
usually be excluded at this stage. ,n addition, the lively bioav^labiiity of any syntheuc 
compound would be a reasonable selection criteria. Thus, the size of the reactants needs to be 
eonlr^ since it is well .mown that molecules above a given range of molecular wetgh 
generally are not easily absorbed. Accordingly, the molecular weight for each r^ctano^ 
calculated. Since the final molecular weight for a bioavailable drug typically ranges from 100 
to 750 and since, by definition, at least two reactants are used in a con^binatorial syndte-. 
reactants having a size over some set value are excluded. Typically, 
excluded at this suge a, the present time. A lower value could be used, but tt ,s felt that the. 
is no reason to restrict the diversity unduly a. this stage in the design process. Once ag^n, 
course, this value can be adjusted depending on the chemistry .nvolved. 

Another aspect of bioavailability is Ute diffusion rate of a compound across membranes 
such as the intesfinal wall. Reactants no. likely ,0 cross membranes (as determined by a 
calculated LogP or o.her measure) would usually be eliminated. A. .he present .ime^althoug 
the CLOGP for reacu>n.s makes only a partial contribution to the product CLOGP, >. 
believed .ha. if any reacUnt has a CLOGP greater than 10, i. will no. make a usab^ produc. 
Accordingly, the CLOGP is calculated for each reactant and only .hose w,,h CLOGP 10 
are kept. Again, in any particular case, a different value of CLOGP could be utthzed. For 
those reactants for which it is difficult or impossible to calculate a LOGP, it is assumed the 
CLOGP would be less than 10 so .ha. the reactants are kept ,n the library design at th.s potnt. 
i As will be discussed later, a CLOGP w«l also be calculated on the products. 

Other reacumts are considered undesirable due to the presence of structural groups not 
considered -bio-relevanf. Bio-releva„ce is judged by companson with known drugs and by 
the experience of medicinal chemists involved in the design of the library. I. is hoped .ha. a 
future formal analysis of drug databases w.ll yield further informatton abou. wh.cb groups 
0 should be excluded. Exclusion on .his basis should be minimized since one of the goals of .he 
combinatorial library design process is to find biologically active molecules through the 
exploration of combinatorial chemistry space which might not otherwise be found. Other 
removal criteria may be based on whether possible reactams involved sugars or had multiple 
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f„„cUo„alWes. A. .he present «me, ..e co.poun.s shown in Ta.,e 5 are .eUeve. .o be 
undesirable and are generally exdud«l a. *e initial stage of library des,gn. 

TABLE 5 
Riolo picall y N""-P^ipvant Groups 




Phosphorylating 
agents 
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Het( = 0)(= 0))Lvg{Lvg: OHev Hal} 
Cll]HetC®l 



Subility, alkylating 
agent 




Other Triaryls 



Alpha-dicarbonyls 



Any:Any-[!r]Any(-l!r]Any:Any)\ 
(-[!rlAny:Any)Lvg{Lvg:HetiHal} 



Oom=[!r]Any(AnyHevVC = [!f10om{Ooni-.0!N} 



The choice of whether to eliminate so.e reactants based on such general and specific 
.0 considerations wiU vary wHh the given situation. Bxcept in the case of — ^ 
recognized that any other limiting selection decreases the diversity of the combmatonal hbrary 
:Zlny eliLates actWe — s. As always, when eliminating reactants at the ve. 
Te inning of brary design, the problem boHs down to a .uest.on of probaba.es: wha . the 
rhooLfmis:gasign.cantlead moleculennthereal world w^^^^ 
,3 least ,s a h.gh probab.Uty that U . unl.ely that such a molecule wU be n.sse^ e s..^ 
criteria under consideration are implemented. The application of many of these sel^t^o 

(P.ce, avarlabihty, tonicity, b— ihty, diffusion, and non-bio„ 
structural groups) can occur before, during, or after the screemng Ubrary has been elec ^ 
ased on other criteria. Clearly, however, the earUer these selection critena are apphed^t 
30 g 1 wtU be the reduction in the number of combinatorial possibUUies which w. need to 
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• uA iannrncess As will be discussed below, not only are these criteria 

»«, . .... ..^ . p.... ... 

I'll!: or «ae^. .e .... . .e e.. s... or 

the Ubraiy design precess is indicaled in Figure 11 at matnx C. 

^^^i;^;;— ^^r;;;;;;-^ -eening library will: 1, 1-ave r^olec^e 

. nf diversitv present in the chemical universe accessible wuh a 

representing the enure range of diversity present 

/ven set of combinatorial materials, and will not have two examples * - J 
.„ Th« »oal is to obtain as complete a samplmg of the diversity oi 

when one will suffice. ^ g^ ■ „ 

Tr:: : rl" Portunities based on diversity considerations to reduce the number of 

" ie~o! PPO-ity occurs after all the combinatorial possibilities from the 
C retans (Id corcTave been selected. The method of the present invention utiles 

hoth opportunities by using validated metrics ^^^^^l^^Z ,^^^. to be 
Anv metric which has been shown by the Patterson pioi vai 

, h,„sed at this stage of the library design process, 
valid/useful when applied to reactants n be used s g 
ri However there are a number of reasons to use a metric which reflects 
" rermb;„a.orial.yaccessiblechemicaluniverse.Theprinc.plereaso„is*^^ 

ehservation of biological systems is that ligand-substrate binding ,s pnmanly Sovemed by ^ 
dimensional considerations. Before a reactive side group can get to the active si e be^^o« 
appropriate electrostatic interactions can occur, before ^PP^^-^^^; 
,5 f ml and before hydrophobic effects can come mto play, the ligand molecule must basically 
"the three dimensional site of the subst^te. Thus a principal considera^n ■ 
t. igning screening libraries should be to sample as much of the three dimensional (stenc 
Z; of the c mbinatonal universe as is possible. The initial method of the presen 
~n does this by utiliring the validated topomeric CoMPA metric to analyse the stenc 

30 properties of the proposed reactants. 

A second reason for applying a stenc me.nc to the reactants ,s that all of the th«e 
dimensional variability of the products resulting from a combina.onal synthesis resides m th 
lituents added by the reactants since the core three dimensional structure ,s common to all 
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molecules in any particular co,„b,na,onal symhesis. In a sense i, wouid be re.un<^. o 
measure .he contribution to each product molecule of a core which is common to ai the 
products. A third reason for applying a three dimensional metric to the react^ts ,s that a 
terically sensitive metric disUnguishes differences among molecules that are not tevealed usmg 
other presently known metrics. For instance, the topomeric CoMFA metric is more sensf ve 
to the volume and shape of the space occupied by a molecule than is, for instance, erther t e 
side Chain or whole molecule Tanimoto descriptor. Figure 12 provides an illustrattve examp 
of th,s feature dtawn from the thiol study which confirms what was seen in the Patterson p o^ 
of the topomeric CoMFA and Tanimoto whole molecule descriptor. F.gure 12 shows three 
, clusters labeled 24. 25. and 29 for wh.ch the Tanimoto whole molecule fingerptlm metnc does 
no. indicate ^y substantial difference in molecular structure among *e molecules, labeled (a) 
through (0. maldng up each of the clusters. The large panel A in the upper right of F.gure 2 
shows orthogonal 3D views of the volume differences within clusters 24. 25, artd 29 
comparing each of the molecules that are not in the majority steric field cluster. For exam , 
5 the Cluster 24 figure B a. the top shows four contours (yellow, green[hidden, red an blue) 
indicating the differences in volumes occupied by compounds 24,a), 24fl.,. 24(0 and 2 ,0 
compared to compounds 24(d) and 24(e) which are found in the same steric field cluster, 
number 10. The middle C and bottom D figures in the large panel A show s.mdar 
distinguishable volume differences for Clusters 25 and 29. While the whole molecule Tan, moto 
20 metric does not distinguish much difference between the molecules within each o th se 
Clusters, it is readily apparent frotn Figure 12, even to an untrained eye, that the mo ecules 
.n the clusters represent very different types of structural diversity; that is, stgn.ficantly 
differem three dimensional volumes are cccupied by the molecules within each whole molecule 
Tanimoto determined cluster. The topomenc CoMFA metric clearly shows steric deferences 
25 that are no. indicated by .he 2D Tanimoto. As seen earlier, a side chain Tanimoto s.nn,an.y 
descriptor also does no. disiinguish stcnc differences amongst some molecules. A metnc 
responsive to steric differences is, therefore, clearly preferred as a diversity discrtm.na.or for 

The iniiial meihod for selec.ing reactants based on diversi.y ,s shown schema.ically a. 
30 the third fiUer in Figure 11. A diversity selec.ton based on three d.mensional steric m«sures 
begins by: 1) genera.ing 3D s,ructures for the reac.an.s; 2) aligning .he 3D molecular 
s.n.c.ures according .o .he .opomeric alignmen. ™les; 3) genera,ing CoMFA stenc field values 
for the reacun,s including, if desired, hydrogen bonding fields, and applying a ro.a.able bond 
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.,«„„a«on factor; and 4) calculating pai^ise topo.eric CoMFA d,ffere„ces for every patr 
riLts. M tMs point the steHc .versity of the reactant space .as been tnapped .nto the 
opon^eric COMPA .etric space. Pro. the validation of the topomeHc CoMFA ntetnc ,t was 
Ind that the nei,h.K.rhood radius for an app^ent acttvity difference of . .o, u„,ts was 
Cefined by a distance of approximateiy 80 - lOO toponteric CoMFA umts (Real m >e,^ 
;:.fore at this ,»int, the ntethod of the invenUon Ousters (using hierarchical clustenng th^ 
„ in toponteric CoMFA space so that reactants having a pairwise difference of less tha. 
reactantsmiop" ,h ,„ ,h, «me cluster Put another way, clustering is 

approximately 80 -100 units are assigned to the same 

continued until the inter-cluster separation is greater than approximately 80 - m un U a 
, .esired, there is some leeway in choosing the exact neighborhood radius ,n and u Mh 
„eighborh«,d range to use for any given biological system. An expenenced practio er of he 
clu tering art will easily be able to determine, by noting the natural breads ,n the clus enng, 
Ire alt the BO-ICK. range best clustering is obtained., This process will produce clusters 
having reactants whose product activities will only rarely differ by more than app— 
5 2 log units. If reactant clusters having products activities diffenng by a greater or lesser 
amount are desired, the neighborhood distance used may be increased or decreased 
accordingly. The effect on the neighborhood distance of choosing such other activity range can 
be seen by viewing the Patterson validating plots for the topomeric CoMFA descriptor. 

The clustering process now identifies groups (clusters) of reactants havmg stenc 
» diversity from one another but also having the same steric properties within each cluster Or 
put in . rms familiar to medicinal chemists, the molecules of each cluster should be b.oisosters. 
For purposes of designing a combinatorial screening library which has within ,t molecules 
representing the full range of stenc diversity present in the universe of reactants, ,. is now only 
nlsary to select one reactant from each cluster for inclusion in the library. A reasonabl 
« way .0 select the one r^ctant from each cluster would be to select the lowest pnced or mo 
readily available one. However, additional criteria may be considered. The diverse reacunts 
remaining at matrix D need not be adiacent to each other on the combinatorial matnx and are 
only Shown this way for graphic convenience. At this point the first suge of library design has 

30 Th'i^te use of a topomenc CoMFA metric to measure the three dimensional 

structural diversity of the reactants has been discussed, it should be apparent that any metnc: 
1) reflective of the three dimensional properties of molecules; and 2) validated as taught above 
could be applied to the reactants to be used in a combinatorial synthesis in the manner taught 
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above The teaching of this invention is not limited to the use of the topomeric CoMFA metric, 
but also includes the use on reactants of all validated three dimensional metncs. As seen 
earlier at the present time initial studies of topomeric hydrogen bonding fields indicate that 
it should be a very useful metric. For those reactants expected to form large number of 
5 hydrogen bonds, this may be the metric of choice. The hydrogen bonding metric would be 
used as an adjunct to the topomeric CoMFA metric in those situations. There may be situaUons 
where a sterically sensitive metric is not needed, in which case it should be clear that any valid 
metric appropriate to reactants could be used. 

r THi^ntificatio " (T^nndinPl Of Products 
10 Once the set of diverse reactants has been identified by the above method, the structures 

of the product molecules can be combinatorially determined based on the synthetic reaction 
scheme and any desired cores. The reactants are used to build the structures of the 
combinatorial products using LEGION ^nd are stored in molecular spread sheets. In matnx E 
the products which can still be built from the available reactants are shown as astensks m each 

15 matrix location. 

n Removal Qf Products Fo r Nnn-Diversitv Reasons 

After the possible product structures have been identified, another opportunity exists 
to reduce the number of products due to general non-diversity considerations. These 
considerations will generally be related to the particular chemistry involved and might relate 
20 to product instabilities, cyclic structures, etc. (Matrix F) 

During the building of the combinatorial product molecules, the size of the product 
molecules increase and various combinations of core and subsutuents will affect the likely 
diffusion of the molecule (and may even form one of the biologically undesirable molecular 
groupings). Thus, in order to eliminate molecules which would not be used as drugs, the 
25 product molecules should be examined with many of the same selection criteria applied to 
reactants In particular, molecular weights should be calculated and those compounds which 
have molecular weights over a predetermined value should be rejected. Typically, a value of 
750 is used at this time as a representative weight above which bioavailability may become a 
problem. In addition, CLOGP should be calculated and any proposed molecule with a value 
30 under -2.5 or over 7.5 rejected. The number of structures eliminated at this point will depend 
in part both on the chemistry involved and the molecular weight range retained at the reactant 
stage. These additional product structures which are eliminated are reflected in matrix G. 
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F Rptnnval r.f Non-Divp r<;e Products 

AS noted, a second opportunity based on diversity considerations to reduce the number 
of molecules to be included in the combinatorial screening library occurs after the products of 
a proposed combinatorial synthesis have been "built" by the software in the computer. Such 
5 an additional reduction is usually necessary since the number of combinatorial products at th.s 
stage may still be astronomically large. This is reflected in matrix G. In addition, it makes no 
sense to screen any more molecules than is absolutely necessary, and redundancy may occur 
in the products for several reasons. In a simple case, if two diverse reactants may react 
independently at each of two possible sites on a symmetric core molecule, two xdenUcal 
10 product molecules will be generated. In a more complex case, it is possible that one 
combination of core and reactants is similar (due to the similarities of structures contained m 
the core to the structure of the reactants) to another combination of core and reactants. That 
is when the reactants are combined with the core molecule, it is possible that substructures 
within the core can combine with different substituents to form similar structures. Clearly, it 
15 would be redundant to screen both. How to select product molecules has been a vexmg 
problem in the prior art, and this is one reason why the prior art has basically been concerned 
with clustering criteria. The general approach taken in the prior art to avoid oversamphng 
combinatorial product molecules representing the same diversity has been to cluster the 
molecules and then maximize the distance between clusters with whatever metric was applied 
20 to the products. 

Based upon an understanding developed from the theoretical considerations of validating 
a metric outlined above, the library design method of this invention again makes use of the 
neighborhood principle to solve this problem. However, it is important to understand that, 
unlike some methods of the prior art, the method of this invention specifically does not use a 

25 metric to cluster product molecules. Rather, the neighborhood definition may be used to decide 
which product molecules to retain in the final screening library and, correspondingly, when 
the appropriate number of product molecules have been selected for inclusion in the library. 
Essentially, starting with one product molecule, additional molecules are selected as far apart 
as possible (in the validated metric space) from any molecule already in the library until the 

30 next molecule to be selected would fall within the neighborhood distance of a molecule already 
included. Additional molecules are not included because to do so would include two or more 
molecules within the library representing the same structural diversity. Therefore, the 
neighborhood principle is used as a sampling rule to insure that molecules representative of 
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U,e same diversUy or otherwise .00 similar are no. included in ,l,e library. The resulung 
combinatorial screening library is no. redundant and has not oversampled the diversity space^ 
in the present invention, the Tanimoto 2D whole molecule similarity coeffcien, ts used 
for the And product selection. As was seen above, this metric possesses the neighborhood 
property. Accordingly, from the combinatorial products either a first product is arb.tranly 
Lsen for inclusion in the library or an initial seed of one or more products may be spectfied. 
af an arbitrary product molecule is chosen. T«nimoto coefficients are calculated for aU other 
molecules to the first molecule and a second molecule with the smallest Tanimoto coe fiaen 
(greatest distance - leas, similarity, from the first is chosen for inclusion., For the effieten 
0 Ltion of additional molecules .0 be included, the distance (1 - Tan. Coeff. betw^n ^ 
addidonal molecule and all molecules Steady included in the library is calculated. For each 
additional molecule, the distance to the closest molecule alr^dy in the library is .den fie^. 
These Closest distances for each additional molecule are compared, and the addittonal molecule 
Whose Closest distance is the greatest is selected next for inclusion; that ,s. the molecule wh.ch 
,5 is farthest away from the closest molecule in the library is selected. A new set of dtstances .s 
calculated and the process continued, selecting one molecule at a .ime. until no more molecules 
remain which are farther away than 0.15 (d - 0.85] the definition of a Tanimoto dtstanc^ 
using the neighborhood value of 0.85). While this example is presented in terms of the 
Tanimoto similarity coefficient, any validated whole molecule metric and its neighborhood 
20 definition may be used with this sampling procedure. 

AS noted earlier, the value of 0.85 for the Tanimoto neighborhood definifon ongmally 
appeared in the sigmoid plots. To confirm whether this is the correct neighborhood defin.uon 
for .he Tanirao.0 me.ric. *e Pa..erson plo.s for .he whole molecule Tanimo.o in wh.ch .he X 
indicaied significance were used .0 ca.cu,a.e the neighborhood value. The me.ric d stances 
25 corresponding to 2-log and 3-log biological differences were determined by dividmg the slope 
of the density determined fine by the values 2 and 3 respectively. Over the data se.s, *e 
average metric distance for a 2 log biological difference was 0.14 »d the average metnc 
distance for a 3-log biological difference was 0.21. Since the Tanimoto distance of (1 - Tan^ 
coeff ) is plotted in the Patterson plot, these values correspond to a 2-log similanty of 0.86 
30 and a 3-log similarity of 0.79. This confirms the reasonableness of using 0.85 in the samphng 
process. Also, as discussed eariier, it is reasonable .0 have more confidence in the definit,on 
of .he neighborhood derived from .he Pa.terson plo.s which utilize all the molecular data. As 
noted with reference .0 selection of a neighborhood distance using the topomeric CoMFA 
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metric on reactants, there may be a situation where a different biological activity may be 
appropriate and a correspondingly different neighborhood distance used for product selecuon. 

conceptually this selection process is reflected in Figure 13. Figure 13 shows a plot 
of the Tanimoto 2D pairwise similarities for a typical combinatorial product universe m which 
5 there has been some selection of reactants based on diversity. As can be seen, a very large 
percentage of the products have similar structures (Tanimoto coefficients > 0.85). The 
sampling process outlined above results in the following. Molecules having pairw.se 
similarities above approximately 0.85 have overlapping neighborhood radii as shown at 1 and 
one of each pair is excluded from the library. Molecules having pairwise similanUes of 
10 approximately 0.85 have almost touching but not overlapping neighborhood radii as shown at 
2 and are included in the library. Molecules having pairwise similarities significantly less than 
approximately 0.85 have no overiapping neighborhood radii as shown at 3 and are also 
included in the library. Excluding molecules with a Tanimoto similarity greater than 0.85 will 
eliminate a significant number of molecules in this representative product assembly. This 
15 reduction is also refiected in matrix H. While the circles of similarity shown in Figures 13 
represent convenient conceptualizations of the neighborhood distance concept, it should be 
remembered that most metrics will not define a space in which the "distance" corresponds to 
an area or volume. In particular, a Tanimoto similarly space docs not have this property, yet 
the "similarity" to a neighbor can be defined and is very useful. 
20 A specific example illustrates the dramatic power of the final selection stage m the 

design process. A proposed combinatorial screening library was designed using thiols and 
sulfonyl chlorides as reactants. (Many of the same thiols were considered in the study 
discussed earlier.) The original 716 thiols and 223 sulfonyl chlorides considered would make 
159 668 potential products. Topomeric CoMFA analysis indicated that 170 thiols and 61 
25 sulfonyl chloride reactants represented diverse molecules for the purposes of this design and 
should be used in further library design. 10,370 combinatorial products were now possible. 
Graph 1 of Figure 14 shows the Tanimoto similarity distribution of the 10,370 possible 
products. It can be seen that a large percentage of the possible products were at least 0.85 
similar to each other. Following the final stage selection process of the method of this 
30 invention, 1 ,656 product molecules were selected none of which was 0.85 similar to the other. 
Graph 2 of Figure 14 shows the plot of the Tanimoto similarities of the final library design 
products. (The Y axis of the graph is plotted in fraction per % so that the integrated totals are 
proportional to 10,370 and 1,656 respectively.) The remarkable selectivity of the sampling 
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p„.ess is immediately apparent. The products of the designed library have a clearly dtfferen 
similarity profile th^ the non-selected products, .n addiUon. there has been a greater th^ 
reduction in the number of pn^uct compounds. Thus, from a possible un.verse of 159,66 
potential combinatorial products, 1.656 have been identifted which represent the structural 
5 Lrsity of the large ensemble. An approximate 100: 1 reduction has been achieved wtdtout 
sacrifcing the diversity of the combinatorially accessible universe. As a result of the l.brary 
design, only the 1,656 compounds have to be synthesized. In addiUon, these «me 1,656 
compounds can be tested in any number of biological assays with a high degree of assurance 
that even in assays with unbtown biological acUvity requirements, these compounds wtU 
,0 present the diversity of compounds accessible through this combinatorial universe to the 
biological assays. Thus there is not only a savings in time and expense in the synthests and 
testing of the identifed molecules in the library, but it is not necessary to change hbrary 
design (With concomitant time and expense) each time it is desired to screen a dtfferen, 
biological assay. Over time, using the library design of this invenUon and the process for 
,5 merging libraries discussed below, it will be possible to build up an optimally dtverse 
combinatorial screening library based on many different combinatorially accessible universes, 
^d this combined library will represent the first real general purpose screening l.brary 
available to the art - a realization of a long sought after, and previously believed unattatnable, 

20 Clearly other validated whole molecule metrics and their associated neighborhood 

distances can be used with the sampling process described above to select product molecules 
for inclusion in a screening libtary. However, it makes no sense to use the same metnc for 
the products as was used for the reactants. For instance, in the case of the topomeric CoMFA 
metric no information would be gained if the metric was used again with the products stnce 
25 all the steric informatton from the reactants has been transferred to .he products. What ,s 
critical is that the combinatorial screening library should be constructed by including product 
molecules which do not fall within the neighborhood radius of other molecules and excludmg 
product molecules which fall within the neighborhood radius of previously chosen molecules. 
At the end of the design process of this invention, a list of product structures and the reactant 
30 sources for each is avdlable in the computer and can be output either in electronically readable 
or visually discemable form. This data defines the combinatorial screening library. The l.st 
of reactants is supplied to synthetic organic chemists. Actual synthesized molecules are then 
available for testing in the biological assays, typically on multiple well plates. The l.st of 
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products from each library design can be used to create a definition of a larger combmatonal 
screening library when merged with other such libraries as discussed below. 

The combinatorial screening library designed by the method of this invention is both 
locally diverse (no two reactants representing the same steric space are present) and globally 
5 diverse (no two products having overall similar structures are present). Such a library thus 
n.eets the desired combinatorial screening librai^ criteria of being representative of the 
diversity of the entire combinatorially accessible chemistry universe while at the same time not 
containing more than one sample of each diversity present (no oversampling). An opUmally 
diverse combinatorial screening library has thus been achieved. By designing an optimally 
10 diverse screening library, a reduction in the number of combinatorially generated structures 
which need to be synthesized and tested of substantially greater than 10^ - 10' should be 
possible. 

Q T paH rnmpounH Optimization 

Unless an entire combinatorially accessible chemical universe is screened, a lead 
15 molecule found from screening a library will rarely be the most active or the optimal molecule 
desired Therefore, extensive additional work is usually required searching for a related 
compound possessing the greatest activity or some combination of activity and another 
desirable feature such as bioavailability. Most of the time, the design of the screening library 
from which the compound was identified provides littie, if any, help in this search. Again, 
20 medicinal chemists must resort to traditional methods of lead development. Combmatonal 
screening libraries based on the methods of this invention provide the means for a directed 
search of the chemistry space in a way not possible with prior art libraries. 

This feature results directiy from the fact that the libraries are constructed at each level 
by selecting molecules which are representative samples of particular molecular diversities. 
25 Thus once a lead is identified, it is a straightforward matter to identify and test compounds 
representative of the same and/or closely related diversity; ie., it is known how to identify 
molecules within the neighborhood of the active lead, as defined by the validated metncs used 
to construct the screening library. Furthermore, the synthetic chemical methods used to 
construct the screening library are already known and tested and can be used to synthesize 
30 additional molecules of the same or similar molecular structural diversity. Since time is always 
of the essence, especially in exploring a newly discovered biological target, a rational follow 
up search through an optimally designed library of this invention permits homing in on crucial 
molecular structures directiy and quickly. Not only does this procedure speed up the 
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development process, but it also avoids wasting the time and effort synthesizing and analyzing 
large numbers of compounds not in the neighborhood of the lead compound which would be 
erroneously tried prior to knowledge of this invention. 

Because the libraries of this invention have been constructed using two selection steps 
based on molecular structural differences, each step provides an opportunity to identify and 
explore compounds having similar structural features. 
A AHvantag*-^ Resulting ^'•""^ Product Filter 

Due to the way the final product molecules were selected for inclusion in the library, 
all compounds with a Tanimoto similarity of approximately 0.85 or greater to a compound 
already in the library were excluded. Therefore, the first place to look for compounds hkely 
to have the same activity as the lead compound is in the group of all compounds m the 
combinatorial universe from which the lead was identified having a Tanimoto coefficient with 
respect to the lead compound of approximately 0.85 or greater. Then, since each of these 
initial compounds will also have an associated group of different compounds wUhm 
15 approximately 0.85 Tanimoto similarity of themselves, this larger group forms the second layer 
of what can be an expanding area of similar compounds to investigate. How far outwards from 
the lead compound the search is carried (each time searching within a Tanimoto coefficient of 
approximately 0.85) will be determined by the success of these additional compounds showmg 
activity in the same assay as the lead compound. Thus, the library design itself identifies and 
permits a directed search for compounds from the utilized combinatorial universe most hkely 
to have activity similar to the lead compound. The same procedure is followed if another valid 
metric, not the Tanimoto similarity) was used to create the library. Then all compounds wUhm 
the neighborhood distance to a compound already in the library were excluded and the first 
place to look would be for compounds which fall within the neighborhood distance. The 
25 process is exacUy identical to that followed using the Tanimoto descriptor. 
R. Advant a pp-s Resulti np From Reactant Filter 

Two consequences flow from the selection of only one reactant from each cluster. First, 
combinatorial products containing that reactant may or may not be the most active with respect 
to any particular given biological screening test. There is no way to guarantee that the reactant 
30 that yields the most active product will be selected from the cluster. For any reasonably sized 
cluster, the probabilities of finding the reactant that yields the most active product would not 
be greatly increased even if two reactants from that cluster were chosen, and, the size of the 
library to be tested would have been doubled. 
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However, the second consequence of selecting only one reactant from each cluster 
presents the flip side of the selection coin. Once a lead compound is identified, the library 
design immediately indicates from which diverse clusters the reactant molecules were chosen. 
All the other possible reactants (in the combinatorial chemical universe under study) 
5 representing similar aspects of diversity are included in the clusters from which the reactants 
were chosen. For lead optimization, compounds containing the other reactants from the 
identified cluster(s) can be synthesized and tested. The library design itself assures that the 
exploration of these reactants is likely to yield compounds with similar activity to the lead 
compound. Thus the reactant selection process not only reduces the number of molecules that 
10 need to be screened, but simultaneously identifies the molecular structures which should be 
subsequently explored to find the compound with the highest activity similar to the identified 
lead. No other prior art library design process provides so much information for lead 
optimization. 

r AHditional QpHn^w^tinn Meth ods Using V alidated Metric ^ 
15 The knowledge that a metric is valid, and what that implies for the metric space as 

discussed earlier, immediately enables methods for lead optimization not previously possible. 
In particular, knowing that a metric will define a design space where compounds with similar 
biological properties are found measurably near each other (the definition of a valid metnc), 
now permits for the first time the quantitative examination of the array of molecules used m 
20 any screening assay to determine whether any molecules are measurably close to the identified 
lead compound. One aspect of this approach has already been discussed in sections 9.A and 
9.B and certainly works best with an optimal library designed by the method of this invention, 
in addition, however, validated metrics will permit useful examination of any assemblage of 
compounds whether or not the lead compound is identified from within the assemblage. There 
25 is no restriction on the source of the additional compounds to be examined and they may range 
from prior art screening libraries to chemical databases. Once a lead is identified, a validated 
metric would be used to map the lead and all other compounds in the assemblage to be 
examined into the metric space; ie, the metric characteristics/values are determined for all 
possible compounds. For reactants (possible substituents) a metric validated on reactants would 
30 be used. For whole molecules, a metric validated on whole molecules would be used. Metnc 
differences between the lead molecule and all the other molecules would then be calculated. 
All molecules with metric distances to the lead within the neighborhood distance of the 
validated metric should have similar biological activities. Again, if the metric distances from 
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each molecule thus identified as falling within the neighborhood distance of the lead are then 
calculated with respect to all other molecules (excluding the lead and each other), a second 
layer of molecules is identified which should have activity similar to the active neighbors of 
the lead molecule. Additional layers may be similarly identified and explored experimentally. 
Depending on the structures involved, at least two layers would normally be explored. Thus, 
because validated metrics are now available, lead optimization will much less often be the hit 
or miss procedure characteristic of the prior art. 

An extension of this procedure yields yet another major advance. In the prior art it was 
not possible to tell how far away from the lead (in structural terms) one should explore in the 
search for a compound more active than the lead. In terms of the two dimensional activity 
island analogy of Figure 1 , no procedure existed for exploring the shape or extent of the island 
of activity. Without knowledge of the island's shape and extent, not only was it impossible to 
know by how far a compound missed the island, but even when an active compound was 
found it was also not possible to know if the island had been sufficiently explored; that is, 
whether all compounds representing the range of diversity spanned by the activity island had 
been identified. In other words, had everyplace been explored that should have been? 

With the molecules identified by the expansion procedure outlined above, it will now 
be possible to map the island. Starting with molecules within the neighborhood distance of the 
lead molecules would be synthesized and tested for activity. If all the molecules within the 
neighborhood distance ("nearest neighbors") show activity, each still falls within the boundary 
of the island, and the next layer of molecules in the neighborhood distance expansion would 
be synthesized and tested. If only some of the nearest neighbor molecules show activity, the 
neighborhood radius of the lead must span an edge of the activity island, and only molecules 
falling within the neighborhood distance of these nearest neighbor active molecules would be 
included in the next layer of the expansion and synthesized and tested. Again, some of the 
newly tested molecules may show activity and some may not. This process of nearest neighbor 
molecule identification and testing should be repeated until no molecule in the next expansion 
layer shows any activity. The active molecules determined by this procedure will define the 
limits and shape of the activity island in terms of structural differences. 
, The resolution obtainable with this procedure depends upon how well the structural 

diversity of the activity island is represented by the molecules in the original assemblage. That 
is if only a portion of the activity island structural diversity is represented in the assemblage 
of molecules, that is the only part of the island which can be explored. Alternatively, perhaps 
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only the island's rough outline can be perceived. Within the constraints of the diversity present 
in the assemblage, exploration of the full extent of the island and of the space within Us 
boundaries can be accomplished with the guidance of the validated metric with which the 
island is mapped. To explore the island further it is only necessary to identify molecular 
5 structures not included within the original assemblage with which to test the unknown temtory. 
in some cases in order to distinguish particular structural differences, it may be necessary to 
consider additional sources of structurally diverse molecules and, perhaps, to map the lead and 
additional compounds in more than one metric space. 'H.us, possible structures can be 
proposed and examined with the validated metric. If the proposed structures fall wxthm the 
10 neighborhood distance of an active molecule, they can be experimentally tested. If those are 
active further structures can be proposed and again examined to determine whether they fall 
within the neighborhood distance of the newly identified active molecule. If they do, they 
would be experimentally tested. Repeating this cycle of identification and testing will ultimately 
yield a higher resolution map of the island and assure the searcher that the island has been 
15 thoroughly explored and no activity peak has been missed. 

The availability of validated metrics enables yet another method of rationally directed 
lead optimization from a knowledge of the structure of a lead molecule which was not 
identified from screening an optimally diverse combinatorial screening library. Essentially, the 
reactant screening process is utilized backwards to identify similar molecular structures, and 
20 then the product screening process is utilized to confirm structural similarity of proposed 
products to the lead. Two cases are important. The first involves lead molecules which can be 
synthesized directly from reactants. In this method, the lead molecule would be analyzed to 
determine from what constituent reactants it may be synthesized. These reactants would then 
be characterized using a reactant metric such as topomeric CoMFA. Molecules in databases 
25 of potential reactants would be characterized using the reactant metric and searched for 
reactants falling within the neighborhood radius of each of the original reactants. The identified 
reactants will provide a basis for building proposed products having the same structural 
characteristics (diversity) as the original lead compound. However, before the product is 
synthesized, its similarity in metric space to the lead would be checked using a product 
30 appropriate metric to make sure that it falls within the neighborhood radius of the lead. 

The second case involves lead compounds in which substituent groups are bonded to 
a central or core molecule. The reactants which form the basis of the substituents as well as 
the core molecule would then be characterized using appropriate validated metrics. Again, 
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„„,ecu.es in databases of possible reaC^-s a.d core .olecu.es would be cbaraaerized wiU, 
validated metrics and searched for molecules falling within the neighborhood rad.us of each 
of the original reactants and core. The molecules thus identif.ed would provide a basts for 
building proposed products with structural diversity similar to the lead compound. Agatn 
5 before synthesis, the proposed products would be evaluated wt.h an appropnate metnc 
conftrm Aa. they fall within the neighborhood distance of the lead compound. 

Since it is known that molecules resulting from different chemistries and tnvolvng 
Afferent consUtuents often show activity in the same biological assay, it would be destrab^ 
„, search as wide a range of molecules as possible when performing the searches outltned 
10 above to identify additional molecules that are within the neighborhoC distance o, some l^d 
compound. Clearly, when contemplating these procedures, it must be recogmzed that th 
universe of all accessible chemical substances, even under the constraints of molecular wetgh. 
that characterize a useful d^g. numbers trillions of structures. While such unprecedented 
directed searches axe only now possible with validated metrics, until the dts^very and creatron 
,5 of the virtual library discussed later, even with today's powerful computers, the pracucaltty 
of such large searches depended on preorganizing the tnl.ions of candtdate structures m such 
a way that the vast majority of candidates could be excluded, to the greatest extent poss.ble, 

at the start of the search. 

For instance, one such useful preorganization involves dividing the candidates mto 

20 series of molecules accessible by some common synthetic route, and thus 

of a core and reactants. (Typically, the synthetic route used to create the lead would be th 
f,rs. investigated and other sets of alternative routes explored seeondanly.) A combtnatonal 
SYBYL Line Notation (cSLN) affords a useful description of such a series of molecules. 

Molecules represented by a cSLN would be constdered for overall similanty to an 
25 active lead molecule in .he manner discuss^ above. Using validated metrics, it ts most 
efficient to: 1) first identify each of the ind.vidual lists of reactants within the cSLN w,th the 
most similar side chain withtn the active lead; 2) next, to consider the similarity of the "core 
within the lead (the atoms remaintng after the side chains are identified, to the non-variant core 
within the CSLN; and 3, then, if the "core- s.milarity ,s not so low that this senes of 
30 molecules can immediately be excluded, to order the variation lists by similanty to the 
corresponding side chains within the lead. The advantage of such a partitiontng and 
preordering by similarity is the abthty to break off the search as soon as no remaining member 
of the series would be likely to be sufficiently similar. 
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AS an overly simplistic example, consider the series of sixteen possible dihalogenated 
methanes which may be represented by a cSLN as: X2Cmxl(Xl:F|Cl|Br|IlN 
lX2 F'CllBrlI! ) If bromoben^ne were the "acUve lead" and the dihalomethanes were the 
seHes \o be consider^, ^ appropriate metric that indicated the lack of similarity of the 
5 aromatic core of bromobenzene to the methylene core of the dihalomethanes would 
immediately eliminate all dihalomethanes without considering each of the sixteen .nd.vtdual 
possibilities. However, if ethyl bromide were the -active lead", an appropriate metnc m.ght 
Show that the methylene and ethylene moieties were suffciently similar to warrant 
consideration of the individual methylene dibalides. and pr^rdering of the variation lis, mtght 
10 immediately lead to dibromomethane as the most similar dihalomethane to ethyl bromtde (the 
first bromine atom being identical to the ethyl bromide bromine, and the second bromine atom 
probably being the most similar to the CH, of the ethyl bromide). In this hypothetical example 
only one molecule instead of sixteen would need to be considered in identifying simtlar 
molecules most likely to lie within the same neighborhood as the lead. Within actual cSLNs 
15 (each possibly representing perhaps millions of st^ctures by including more points of vartatron 
and many more and larger variations at each point), the speed enbancemem obtainable from 
this searching strategy would be many orders of magnitude greater than sixteen. 

There may be other variations e applications of the methods outlined above which 
are not yet recognized at the pre^nt time since the concepts and applications of this inventton 
20 are still so new. However, reasonable extrapolations/techniques of molecular discovery wh.ch 
follow from the disclosure of the present invention and, in particular, from the abthty to 
validate metrics, are considered within the teaching of this application. 

;n Merging l ibraries 

The final selection (sampling) methodology of this invenUon has broader uses than yet 

25 described. So far. this disclosure has been primarily concerned w,th the design of a 
combinatorial screening library based upon either sets of reactants or sets of reactants and 
central cores. Each combinatorial screening library based on these materials only explores the 
diversity of that part of the chemical universe accessible with those compounds. Unless as 
much of the diversity of the entire combinatorially accessible chemtcal universe ,s explored 

30 tn a screenmg library as is possible, there is no assurance that a molecule possessing ac„v„y 
with respect to any particular unknown biological assay will be found. Clearly, the useful 
diversity of the combinatorially accessible chemical universe can only be explored w,th as 
many sets of reacutnts attached to as many cores as is possible. Stated slightly differently. 
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U,ere may be large parts of me diversity of the chemica) universe no. explored by one or even 
a few combinatorial schemes. Thus, combinatorial screening libraries based on mulupl 
reactants and muUipie cores would be desirable. Just such libraries can now be created Utrough 
,h, „,e of the virtual library discussed later. However, even with screening hbranes 
5 constn^cted with .he method of .his invention discussed above, .he simple addiUon to each 
other of many such libraries will quickly increase .be toul number of molecules wh.ch need 
.0 be screened. Worse ye., since many of .he ^ssible reacun.s used for combma.onaI 
syndesis wiU, different cores have similar structures, and since many of the posstble cores 
used for combina.orial synthesis may differ little from each other, i. .s highly Ukely U,a, much 
,0 of .he same diversi.y is repr.sen.ed .o a greater or lesser ex.en. in each of .he hbranes 
ge„era.«l from these materials. Simply combining the libraries would again result u> 
oversampling of .he same diversi.y space. .. would clearly be more useful and econom,cal 
(efficient) in terms of .ime, money, ^d opporiuniiy .o use adduional serening .0 explore 
differen. aspec.s of .he diversi.y of .he chemical universe. . , . 

,5 Anorher significan. fea.ure of .his inven.ion is .he recognition .ha. ,he netghborhood 

selecion (sampling) cri.eria ^so provides a me,hod .o combine combina.orial screemng 
libraries .o avoid .his oversampling problem. Suning wi.h an arbi.rary firs, library, usmg a 
validaied mertc which can be applied .o whole molecules, each molecule of a second hbrary 
is added .0 .he firs, library if .he molecule does no. fall wUhin .he neighborhood radius of any 
20 moiecule in .he firs, library as supplemen.ed by all .he added molecules from .he second 
library This process is cominued un.il all .he molecules in .he second library have been 
examined. In .his manner, only molecules represen.a,ive of a differen. asp^. of diversi.y are 
added from .be second library to the firs.. Each successive libnry is added m .he same 
manner. The molecules in a final combined library formed from smaller libraries selecled 
25 according .o .he me.hod of .his inveniion represen. diverse molecular compounds and have .he 
optimal diversiiy which is desired of a general combinaiorial screening library. However, even 
if .he groups of molecules .0 be merged have not been selected by .he me.hods of th.s 
invention, they may be merged according to the above procedure if first, a subset of each 
group of molecules is seleced according to the product sampling method of the design process. 
30 This will insure tha. s.milar molecules wi.hin each group are elimina.ed. The resulting merged 
library will no. be op.imally diverse, but i. should no. redundan.ly sample .he diversi.y presen. 

in the separate groups. 

The 2D Tanimoto fingerprint metric is useful in performing the library addUions. The 
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2I> Tanimco similarity coefHcien. of eacl, mol«.ule in .he f.rs. library ,o all molecules >n a 
subsequent library are calculated. Each molecule of the second library is added to the fl^t 
library if the molecule does not fall within a 0.85 Tanimoto coefficient (the nerghbor ood 
radius, of any molecule in the first library as supplemented by all the added molecules from 
5 the second library. As long as the metric used for sampling and end-point determinatton ,s 
valid (has the neighborhood property), this selection method guar^tees a combined hbrary ,n 
which all of the accessible diversity space is represented with little likelihood of oversamplmg^ 
A„ example of three prior ari libraries no. designed with fte me*od of this invention wh,c 
might be merged using .he neighborhood sampling cri.eria is shown in Figure 15. F.gure 15 
,0 Shows ,he dis.ribu.ion of molecules plotted according to their T^imo.o 2D pairwise s,m,lar,ty 
of the Chapman & Hall Dictionary of Natural Products, Dictionary of Pharmacological Agents, 
and Dictionary of Organic Compounds (CD ROM Versions,. It is immediately clear from 
Figure 15 that simply adding the three libraries together would produce a combined l.brary m 
which most of the compounds would be very similar to each other (Tanimoto simtlartaes 
,5 > 0 85, Further redundant similariiy would be expected from a comparison of the stmtlartUes 
between the molecules in .he .hree libraries! The posLion of .he 0.85 similarity poim to the 
bulk of the molecules in each library indicates that, most of .he molecules in .hese databases 
would be excluded from a combined library formed by merging the databases by *e procedure 
outline above, 

20 1 1 nthfT kd^w "?"' "f Onlimlllv Divrse T-ibraries 

There are addi.ional benefus achieved by designing combinatorial libraries according 
,0 the method of this invention. For instance, as noted eari.er, one of the difficulties of 
screening several compounds simulumeously is .he possibili.y of non-specific acuvr.y bemg 
detected due to the con.ribu.ory effec. of .he combination of compounds. In fact, the likelihood 
05 of this effect rs mcreased when compounds of the same molecular structural and chem.cal 
diversity are tested ,n .he same assay. With the hbraries of this inventton, i. w,ll be poss.ble 
.0 design .he assay combinations so that only compounds representing different aspects of 
diversity are tested .ogCher. Wh.le .his procedure can no. guarantee .ha. no combinatton 
effects will occur, i. makes it much less likely. Another benefl. achieved is .hat complex 
30 deconvolutions will generally be unnecessary. Deconvolu.ion problems are accepted ,n .he 
prior ar. as a necessary evil due .o the enormous number of molecules which must be 
synthesized and screened since virtually all combinatorial possibilities are included m the 
libraries Clearly, with smaller oprimally diverse combina.orial screening libraries covenng 
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the sam= search .emtory as .he larger prior art libraries, i. is possible with ,he aid of computer 
controlled robots and data bases to individually synthesize and track each compound. 

M mentioned at the beginning of this disclosure, the methods of this invention are also 
applicable to problems outside the speciftc area of drug research. The notion of cboostng 
com^unds based on diversity is a general concept with many applications and is appUcab^ 
any time the problem is presented of having more compounds than can usefully be tested/used. 
The example was given earlier of determining what compounds had the same structu^ 
diversity as a previously identified (biologically active) compound. Of course, wth the 
methods of this invention, the activity may be any chemical activity. In addition, the untverse 
of chemicals from which only some are to be selected does not have to result from a 
combinatorial synthesis, but may result from any synthesis or no synthesis a. all. An example 
of the later would be the solution to the question of selecting molecules of similar dtverstty 
from among those in a large corporate or catalog data base. In these cases, an approprrate 
metric (remembering that different metrics are applicable in differem circumstances) would be 
applied to all the compounds and clustering would result in compounds of the same diverstty. 
The methods of this invention, including metric validation, topomeric CoMFA metnc 
Characterization, end-point neighborhood sampling, lead compound optimization, and Itbrary 
design can all be applied separately and together to solve the selection problem. 
19 Virtual l ibrary Cop '"T'-lion A .Searching 

The two step sequenUal design process for selecting optimally diverse product molecule 
libraries set out so far in this application is necessarily computationally time consumtng. 
limited to consideration of one set of synthetic reactions at a time, and eliminates at the firs, 
stage reactanu which might be capable of generating products which would pass the product 
stage neighborhood filtering criteria. The process is compuuttionally time consuming s.nce, for 
any given set of reactants, the s.eric metric must first be computed, the resulting descnptors 
clustered and a selection of reactants made based on the neighborhood rule. Only after th.s 
first stage can .he possible product molecules be determined, a second produc. memo 
calculated, and selection made of the final library members. 

The process is limited to one set of synthetic reactions at a time in the followmg sense. 
, Firs, a parricular organic chemical reaction scheme is identified as well as the core and 
possible reactants whtch may be used in the scheme. Each sequential step of library destgn ,s 
sequentially implemented and results in an optimally diverse library for that reactton. For a 
slightly different core which involves the same chemical reaction scheme and the same 
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counts, ft= enure proeess including all calculations must be repeated. Each combinatron of 
core and r^ctanu generates a different library. In the n,e.hod of the above referenced patent 
application, the resulting libraries, individually derived, are then combined. This process also 
adds additional titne to the assemblage of a larger optimally diverse library. Finally, the 
product stage of the design is constrained by the reactant stage; that is. since it is destrable to 
generate as n,any diverse products as possible, some products may be sufficiently diverse (as 
confirmed by the product neighborhood metric) when created from similar reactants (those 
failing within a topomeric neighborhood Custer) by virtue of the mere combination of the 
reactants into the products, and such products should be included in the library. 

In addition, considemtion of the above techniques of optimally diverse library design, 
lead opumization. and merging libraries all point to the distinct advantages of being able to 
explore the diversity of combinatorially accessible chemical universes using/includmg as many 
reactions, core, and reactants as possible. Thus, it was recogni^ that, ideally, library des.gn 
and lead optimization would be ,nos. useful if all combinatorially accessible molecules could 
be meaningfully searched. The sheer number of molecules involved (trillions) would s^m to 
suggest that even with today's fastest computers, such a library design and searching would 
be unachievable. However, using the power and utility of validated metrics, a way to create 
and search a data base containing representations of products from as many combtna.onal 
reactions and reactants as desired (a huge combinatonally accessible universe) has been 
discovered. This dau base is essentially a virtual library of combinatorial products because, 
as will be explained below, all information necessary and sufficient to search across and 
construct all possible product molecules ,s contained within the virtual library even though the 
structure of each combinatorial product is not explicitly contained within the virtual hbrary. 

The virtual library can be used not only to select screening libranes, to find molecules 
with similar structures to a lead compound, to perform lead explosions, but, through the use 
of validated metrics, it can also be used to search for and select compounds likely to have 
similar biological or other physical properties from across the broader chemical universe. In 
fact, as will be seen below, use of the virtual library opens up possibilities for searchmg the 
accessible chemical universe in ways not heretofore possible. 
3 With respect to the selection of screening libraries, it has been discovered that the same 

approach to design as previously described can be performed more efficiently and more exactty 
by combining the formeriy separate steps of topomeric selection of reagents and Tammolo 
selection of products into one step which operates on the entire se, of all possible products 
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from the reaction under consideration. Another advantage of this approach is that generally a 
larger group of diverse compounds are identified; that is: the significant (active) metric space 
is sampled more extensively. Additionally, the method by which the maximally diverse set is 
selected can be modified to yield results which more readily suit the practical issues of 

5 laboratory synthesis. As a consequence of this discovery, an efficient method for identifying 
molecules of interest from the billions of possible products obtainable from combinatorial 
syntheses has been discovered. Indeed, use of the virtual library is not limited to finding 
molecules derivable from known synthetic combinatorial reactions, but is generally applicable 
to molecular selection. As with the selection methodology discussed above, the ability to create 

10 and search the virtual library relies upon the power of the neighborhood property of validated 
metrics to distinguish the similarity or dissimilarity of molecular properties between molecules. 

The creation of a virtual library using validated molecular descriptors enables methods 
to identify compounds of interest from many possible compounds and is particularly applicable 
to identifying compounds of interest from extraordinarily large numbers of compounds. The 

15 application of these novel methods speeds the searching operation and in some ways extends 
the types of searching criteria which may be used. Most importantly, construction of a virtual 
library makes it possible to identify compounds of interest by an exhaustive search through all 
possible compounds from a series of known synthetic reactions - thus providing a capability 
which does not currentiy exist otherwise. In particular, the virtual library provides a large 

20 number and variety of ways to select a subset of compounds from a very large number of 
compounds. The number of compounds from which to make the selection is likely to range in 
the trillions of compounds, based only on known synthetic reactions and commercially 
available reagents appropriate for each reaction. 

The following disclosure of the method of constructing and searching a virtual library 

25 will be discussed with respect to those compounds accessible through combinatorial syntheses. 
However, as noted above, the virtual library is not limited to such combinatorial compound 
universes and these universes are disclosed by way of an example of the methodology of the 
discovery, not a limitation thereof. 

The significant aspect of being able to create a virtual library using validated metrics 

30 is the ability to identify from the large universe of compounds those with related properties 
and/or structural characteristics without having to examine individual structures; in other 
words, to do structural searches without directly comparing (looking at) structures. This is 
made possible by precalculating, as much as possible, characteristics for the component parts 
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of the product structures. Clearly, then, the beginning point for this method is the construction 
of a database, or "virtual library", of possible chemical compounds, products, which can be 
synthesized from a common reaction. 

A Derivatinn of the Database (Virt ual T.ihrarv^ of Compounds 
5 The database of compounds, "virtual library", to which the method of this invention 

may be applied is an assembly of the combinatorially derived product structures resulting from 
any number of synthetic reactions. In initial applications tens of reactions are used to construct 
the database (virtual library) of interest. The total number of possible product compounds 
becomes astronomically large very quickly. For instance, there are approximately 500 
10 commercially available molecules having reactive diamino groups and approximately 15,000 
commercially available reactants which will react independently with each of the amino groups. 
Combinatorially there can therefore be generated 15,000 X 15,000 X 500 (1 12 billion) possible 
product molecules from this one reaction scheme alone. 
R Overview of Methodology 
15 A fundamental part of the discovery of how to create and use a virtual library is a 

method to precompute properties based on 1 + N, + N, + N3 Nm structural variations 
which can be used to exactly, or with useful degree of approximation, predict the 1 x N, x N2 
X N3 X... Nm product structure properties which arise from all combinations of the structural 
variations about the 1 core at all M substitution sites. In the earlier part of this disclosure, the 
20 variable parts of a combinatorially derived molecule were referred to either by reference to 
their source (reactants) or their molecular configuration when attached to the core (side 
chains). When discussing creation and searching of a virtual library, the more generic term 
"structural variations" is appropriate for the groups appended to a core. The reasons for 
adopting this term will become clear later during the discussion of searching the virtual library 
25 with respect to non-combinatorially derived structures. 

Figure 16 shows in schematic form a representation of three structural variations 
attached to a central core. In Figure 16, each possible product structure arises from combining 
the core substructure with exactly one of the N, choices in the set of structural variations {R,}, 
exactly one of the N2 structural variations in the set {R2}, etc. 
30 For many properties, such as molecular weight and price, or count of rotatable bonds, 

or number of H-bond donors and acceptors, the values associated with the product compound 
are exactly the sum of the appropriately created structural variations. 

For some properties, such as logP, the assumption of additivity is inexact but adequate 
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for the purpose of selecting a small subset from a very large number of possible products. 
For other properties, particularly the topomeric shape descriptor, the comparison of two 
product compounds' properties requires a decision on how to match each structural variations's 
descriptor in the first product to one structural variations 's descriptor in the second product 

5 such that each structural variation is referenced exactiy once. 

There are also some properties (such as molecular fingerprints) which are representative 
of the whole combinatorial product molecule and can not be represented by the sum of the 
constituent structural variations. The method for deriving these properties will be discussed 
below. Generally, however, by this method a virtual library containing descriptions of the 

10 structures of all possible combinatorially generated products can be created from a knowledge 
of the properties of the structural variations. 

C. Overview of Virtual Library Construction 

Initially information on the reactions to be included and the reagents which may be used 
with those reactions needs to be gathered and entered. In addition, the reagents need to be 

15 converted to their corresponding structural variations. The overall process of virtual library 
construction is summarized in the flowchart of Figure 17. The first step in the creation of the 
virtual library is to create for each possible structural variation (variable part) a file containing 
various parameters/characteristics associated with that structural variation. Typically the file 
may contain information on the price, source, availability, MW, and logP. In addition, the 

20 metric characteristics for the structural variation resulting from the application of validated 
metrics to the structural variation structure are included in the file. Other characteristics which 
might be used for searching may be added to the file. Similar files are created for core 
structures. As with the earlier discussion of designing optimally diverse libraries, any validated 
metric may be chosen to characterize the structural variations or cores. For purposes of 

25 discussion of the virtual library, the same metrics, topomeric CoMFA and Tanimoto 
fingerprints, will be used as in the examples earlier. 

The second step in creation of the virtual library is a description of the chemical 
transformation represented by the chosen chemistry. The virtual library is then created by 
combinatorially combining all structural variations in the chemical transformation to generate 

30 virtual library descriptions of all possible product molecules. 

Substantial effort is required to produce the representation of the structural variations 
forming the database from a given reaction. The software provided as Appendix "E" and 
Appendix "F" to this application is used in conjunction with the commercial software products, 
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Selector and Legion, to compute properties of the structural variations and to combine two or 
more such lists of structural variations along with a core structure to produce the representation 

of all possible products. 

Particular skill is required to convert the chemist's description of reaction conditions 
and reaction validation into a set of selection criteria applied to a database of available 
reagents, by which only those reagents which are actually likely to yield the desired product 
in the specific reaction conditions are included. (Here "reagents" refers to chemical starting 
materials which undergo reaction to produce the products. A reagent corresponds to a molecule 
used in a structural variation in the method, after some rearrangement of bonds.) Additionally, 
methods for automating chemical judgment to derive the list of reagents and to compute the 
properties such as the topomeric shape descriptor have been developed. Finally, a key concept 
in constructing the virtual library is to organize the process of library definition so that it 
depends on a relatively small number of parameters which can be stored in a table so that each 
row in the table defines all the information that is necessary to specify a combinatorial library. 
While the following discussion addresses formation of the virtual library in terms of chemical 
transformations, cores, and reagents and/or structural variations which may be used, it should 
be appreciated that data in the virtual library may be generated by any cores and structural 
variations as long as the resulting compounds can be described by a cSLN. Thus, even product 
molecules which can not be synthesized by a known combinatorial reaction can be included 
20 in the virtual library and their structures searched. 
D- Virtual Library Construction 
The first phase of construction of a combinatorial library to be included in the virtual library 
takes as input a description of the chemical transformation represented by that combinatorial 
library and a list of available reagents and produces as output all the part structures (a/k/a 
25 structural variations) found in the list of available reagents which are appropriate for the 
chemical transformation, along with all structure-invariant physicochemical properties of those 
fragments that might be useful in different types of subclass (subset) searches. As is apparent 
from the earlier discussion, the same general and biologically based elimination criteria can 
be applied to the proposed structural variations before selection of the structural variations for 
30 inclusion in the virtual library. Alternatively, structural variations which would be eliminated 
by the general or biologically based criteria can be flagged but still included. Having the 
structural variations flagged, few potential product structures are eliminated from the virtual 
library, but the products containing particular types of undesirable structural variations can still 



81 



10 



15 



be removed during selection. 

in the course of this process, data are entered and recorded permanently into three 

tables: . 

REACTIONS (a Molecular Spreadsheet) = information about a reaction scheme. Each 

5 record corresponds to a reaction. A typical reaction would be: "reaction 

of each nitrogen of a diamine with various reagents such as acids 
(acylation) or ketones (reductive amnination)". 
REAGENTS (a Molecular Spreadsheet) = information about a particular set of 
reagents used in some instance of a reaction. Each record corresponds 
to a particular logical reagent structure search in a database of such 
reagents, presumably a set of reagent structures which will all react in 
the same way. For example, there are sixteen reagent records for the 
diamine reaction, enumerating each of eight reactant classes that might 
react with each of the two nitrogens. One record for example describes 
a reaction with epoxides, that could be ring opened nucleophilically 
(and regioselectively) by an amine to yield a beta-amino alcohol. 
RDATA (an Oracle Table) = invariant physicochemical c computed about 
structural variations, typically the varying portions in a CSLN, with one 
record for each structural variation encountered in any cSLN 
20 constructed. Thus data need not be recomputed when such structural 

variations are reencountered, a substantial savings in processing time. 
For example, records will be added describing the properties of a 
-CH2CH(0H)R chain (structural variation) for each (new) epoxide-R 
reagent retrieved by the example record just given for the REAGENTS 

25 spreadsheet. 

Entering a new reaction into the system involves inputting the data for a new row to 
REACTIONS and at least two new rows to REAGENTS. This data entry operation is the only 
required data entry in preparation for virtual library production. 

All these operations of table preparation are carried out by the SPL script getacd.core 
30 (Appendix E) and executed within the commercially available software product SYBYL. The 
code for producing the topomeric CoMFA field descriptor of each structural variation is 
provided as Appendix F, CTOPS. 
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; Pp prf>.;entation o f the. Databa':'- of rompounds 
The virtual library database of compounds for any one synthetic reaction is represented 
as a set of chemically bonded (connected) structural variations where the connecting elements 
n.ay consist of a common core (one or more atoms which are identified in all members of the 
5 set) More than two variable sites may be involved. The list of structural alternatives therefore 
contains two or more elements, each of which represents a specific molecular fragment and 
a number of associated molecular properties. Table 6 and Table 7 below are produced by 
getacd core. For each combinatorial scheme a set of files is generated. For a di-substitution 
scheme the first file defines the combinatorial scheme, and the second and third files descnbe 
10 the structural variations which can be utilized at the two sites. For a tri-substituted scheme, 
there will be a set of four files: the first defining file, and three additional files descnbmg the 
structural variations for each of the three sites. The number of files in each set of files is 
clearly determined by the combinatorial scheme involved. 

In Table 6, the information following #@CORE describes the core, the information 
15 following #@CONNECTOR describes the location of attachment of each of the two varying 
sites and the #@QUERY line shows an example of how the list of structural variations may 
be specified. Essentially this QUERY describes how to combinatorially construct product 
molecules out of the structural variations and is used after searching of the data base is 
complete to generate actual product structures. 
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TABLE 6 
<;am p1p. cSLN File 



#SYBYL/3DB HITLIST 

# Created: Date Time 

i^@CLASS STRLIST 
25 #@DAT ABASE NONE 

#@SOURCE VDB_BUILDER 

^©SUPPLIER 

#@PRICE 

#@FCD 
30 85.062 

#@LOGP -1.05 

#@CORE X1C(=0)CH2NHC(=0)X2 
#@CONNECTOR l.Xl =2;11,X2=9 

35 fo'ic(=0)CH2NHC(=0)Y_02{Y_02:FC(F)(F)C[5]:C 

{Y^_0hFC(F)(F)C[5]:CH:C(:CH:C(:CH:@5)OCH3)NH< V = 19 > } 
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n Ap plication nf A First M etric . (ToT wmeric CoMFA) 
Table 7 shows the format in which the structural variations for the first variable site 
are listed, including both the structure in Sybyl Line Notation (SLN) and a set of related 
properties such as SUPPLIER, PRICE, molecular weight MW, estimate of hydrophobicity 
LOOP and a field, CTOPS, which in encoded form represents the novel shape descriptor, the 
topomlric field (the steric field of the topomeric conformation) for the corresponding structural 
variation. Information on only two possible structural variations is shown. For the diammo 
example above, this structural variation file would contain all of the structural variations which 
react with an amino group, approximately 15,000 entries. 

TABLE 7 
<;tnirtiira1 Variations At First Site 
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liiiiliiiili^ 

''^''''^ nniiiiiiiuiiiiiiiiiiiiiiiiniiniiiimm^^^^^^^^ 



20 



25 



30 



35 



40 



innHniiiiiiiniiiiiiiiiniiniiiiniiiiiiii 
iiiiiniinniiiiiiiiiiiii2fiiiiiiiiffeiiiiiii4ffiiiiii 

n u 1 1 innuffiiiiii^ 
;ilmi;;i;uuun,ilnimiinii.nniiiiuiuinuiininiiuiiiiiniiin 

111111111111111111111111111111111111111111^^1^^1^ 

ES 

1 11^ 

1,11111111111111111111111111111111111111111111111111111111111111111111 



11111111 111 111 111111111111111111 11 111 11 111111111111111 > 



10 



84 

A second file similar in appearance to that of Table 7 which lists all the structural 
variauons which may occur at the second site is also created. 

Applir.Hnn of A Secon d Metric ( Tanimoto Finperprint) 
The overall process of applying the Tanimoto fingerprint metric for use in the 
virtual library is summarized in the flowchart of Figures 18, 19, and 20. As mentioned 
above, certain properties (molecular descriptors) of the product molecules can not be 
simply computed as the sum of the associated properties of the substructures used to form 
the product molecule. One of the most important and challenging to compute of these 
xnolecular descriptors is the molecular fingerprint. This product descriptor can not be 
calculated as the simple additive results of the descriptor of its pieces. For fingerpnnts, any 
fragment which is not fully contained within the core alone or wUhin one structural 
variation alone will not be represented by treating each piece separately. Therefore, a 
fingerprint descriptor is computed for an extended core consistmg of the structural vanation 
at site R, and including the substructures which consist of: 
j5 1) the structural variation; 

2) the common core substructure; and 

3) all invariant atoms contiguously connected to the core occurring in structural 
variations at sites other than Ri. 

This process is repeated for ail sites. 
20 Thus in Figure 16, if eacl, selection in (R,) includes an 0CH2 group connected to 

the core and each selection in JR,! contains a CH connected to the core, the frngerpnuts 
corresponding to a selection from (R,) will describe the substructure formed by ttas 
selection connected to the core and also including an 0CH2 connected to the core a. s,te 2 
and a CH connected to the core at site 3. 
25 For the standard definition of 2D fingerprints, this method can yield an exact result 

of the product fingerprint whenever the shortest connected path through the extended core 
is 5 atoms or more by OR-ing (a Boolean algebra manipulation) the fingerpnnts of each of 
the 3 structural variations in the example above. There is no need to include a separate 
fingerprint for the core, since it ,s conuuned in all .he structural alternative descnptors. 
30 There is no hazard of duplication, since a fingerprint with a few exceptions notes only the 
presence of a connected fragment, not the number of occurrences. That is; either a btt ts 
set in the fingerprint for that structure or it is not set. Duplicate occurrences of the same 
structure can not set the bit twice. In the few cases, such as ring and halogen structural 
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features, where a count is maintained, correction for these bits of the fingerprint may be 
accomplished by explicit correction by count of structural variations plus core. 

In some cases the extended core is not large enough to assure exact construction of 
the product fingerprint from that of the pieces (i.e. some relevant fragments start in one 
5 structural variation, span the extended core and reach into the individual alternatives at 
another site). To create and explicitly fingerprint every compound is in fact possible for a 
set of one million products. For the creation of a virtual library with initially tens of 
millions of products and ultimately hundreds of millions and even hundreds of billions of 
product compounds, explicit fingerprint computation is not feasible in any realistic Ume 
10 frame For this scale of virtual library creation an approximation is both acceptable and 
necessary. Finally, since the purpose of the creation of the virtual library is to provide a 
basis for searching for molecules matching some subset criteria, the approximation method 
must ensure that such searches are reliable. 

For the approximation, a random sample of a statistically significant fraction 
15 (typically for a very large virtual library, O.OOl) of the products is taken. Each sample 
product is checked to see how many bits are in the product but not in the fingerpnnt 
composed from the pieces. The largest observed difference value, MBITS, is maintained 
for future calculations and is used to identify, for example, all products which might be 
similar to a given structure in the extreme case in which all MBITS missing bits were m 
20 fact those which would make every product most similar. 

The Tanimoto is defined as (#bits in common) / (#bits in either) for the similanty of 
two compounds' fingerprints. In the case at hand, the estimated product fingerprint might 
have as many as MBITS bits which are actually present in the product fingerpnnt but 
missing from the estimate. In the worst case, every one of those bits would be in common 
25 with the bits in the query compound's fingerprints. Since Tanimoto = (#bits in common) / 
(#bits in either), in our worst case this is (apparent #bits in common + MBITS) / (#bits in 
either) since every one of the MBITS bits is already represented in the #bits in either but 
is not present in the apparent #bits in common (i.e. the #bits in common based on the 

estimated product fingerprint). 
30 By adopting this approach, an upper bound is calculated on the largest possible 

Tanimoto between two compounds. The actual product fingerprint cannot yield a higher 
Tanimoto than this, and almost always yields some value between the apparent Tanimoto 
and the upper bound. In some cases this estimates the largest possible Tanimoto to be 
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greater than the actual maximum of 1.0; it serves no purpose to correct for th,s! 

An example may be useful. Details of the computations are provided m the attached 
c«le dbcslnprepro, but to illustrate the concept assume that what is desired is a subset of 
compounds defmed as those with a Tanimoto similarity of 0.80 or higher to a spectfied 

5 reference compound. By the methods of this invention the ftngetprints of every one of the 
2000 structural variations at two sites (1000 each) have been precomputed. An esumate can 
be made of the fingerprints of every one of the 1.000,000 possible products by OR-.ng .he 
two site's fingerprints for every selection of one from each site. For a specific possible 
prcKluct the number of common bits is 78 and the '* of bits in either" is 100, so that the 

10 app^em Tanimoto is 78/100 which is below the cutoff of 0.80 and the product would not 
be selected. However, if the MBFTS is 3, then the worst case could have 78+3-81 btts m 
common out of 100 bits in either, and the largest possible Tanimoto would be 81/100 
which is greater than the cutoff. If it is desired to err on the side of not missmg any 
possible products, this value would be accepted even though the apparent Tanimoto ts too 

15 small. 

The results of the fingerprint calculations discussed above are added as two 
additional fields to the structural variation files: fpcard and fp, which together represent the 
two-dimensional fingerprint of the structural alternative and everything to which it is 
connected in all of the resulting products; this additional structure being needed to more 

20 fully represent the fingerprint of a product compound by that of the structural vanations 
which combine to form it. At the minimum, the common structural portion by which the 
alternative's structure is augmented is that of the core. Appendix G contains the code 
dbslnprepro which calculates and adds fpcard and fp. 

When the fingerprint terms, fpcard and fp, are added to the file structure shown m 

25 Table 7, the complete file format for each structural variation follows the form: 
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TABLES 



PC(P,(P,C[5,:C„:C(:CH«5,OCH^NHR,<PCD^^^^^^^^ 
;SUPPUER.ALDR1CH;MW= W^^^^^^^ 

, iifBiiiiin2fiiiiiiiiVii!^\\n^^^^^ 

5 llllllllUinilllllllllll^ lllll-fpcard=141;fp=080()()020 

.84 SUPPLIER =aldwch;MW =f 8, f ;^oG[-^fi",f ™ff, j u'n u u n 1 111 1 1 1 1 U 1 
iiuiiuiiiiiiiiiiiiiiin iiiiiinniiiiiniiuu 
uuiiniiiiiinnniiii ,1,1,^ 

■5 luiuiiiiiiiiiiiiiiiiii'i" iiniiiiiinniiiiiiiiiii' 
uiniiiiiiiiinuniiiini 1 '"^^^^^^^^^ 

003cff810100> 

When initially constructed the v.nua, library consisted of the files descnbed above. 
Hcever, since the fingerprint .etr,c is calculated for each set of structural ™s 
attached to a specific core, separate structural variations files contarntng the flngerpnnt data 
were required for each combinafon of core with the structural variations. The v.rtual 
library therefore contained a great deal of redundant data (structural variation files 
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,.peMve,y containing *e sa.e non-Hngerprin. daU). Accordingly, a more efncien. vinuaj 
Ubrary is constructed by locating the fngerprin. f.es associated with each structural 
variaUon ftle and different cores in separate files. Thus, only one copy of each structural 
11 file Oi.e Table 7, is required, and there is an associated fingerpnn..e co« 
fpcard =u,d fp for every cot. with which the structural variation file ,s used. Tlte ..n^ 
Ubrary keeps track of all the individu^ files in a master file. For instance, on one hne of 
theTaster file is kept the infomtation that the Table 6 file is associated wit,, its appropna e 
structural variation files and fingerprint files. Each line of the master file relates o,,e Table 
6 like file (CSLN) file with the appropriate st™c,ural variation files ^d fingerpnn files. 
T same Ltural variation files .ay now be use. with more than one cSUN as long as 
the same type of chemical reaction is involved. Append.x O contains the code dbc^npre,^ o 
11a "powTr-) Which calculates fpcard and fp, writes the fingerpnnt files, and updates the 

aearly the data associated with each structural variation in each file can be dir«:tty 
expanded to include the results of the application of any other validated metric to the 

Structural variation. 

i„ l„mmrv of M °"-~l * ^cone of Chemistry 
Cr^ton of a virtual library of structural vartation files along with one definition file 
is all that is needed to describe all the products of a combinatorial synthesis, that ,s; all 
possible products of the combinatorial synthesis are now described ustng only de.rrp.ors of 
Z structural variations. As many additional combinatorial synthesis may be added to th 
Virtual library as is desired. Clearly, the larger the number, the more comprehenstve wrll 
be the universe of accessible compounds which can be searched. In this manner the N, x N, 
X N, X number of products may be analyzed using only the N, + + N, number 
25 of structural variations. This abiltty to search a geometrically large number of product 

structures by searching through only the arithmetic sum of their parts is the key feature of 
the virtual library and is possible because of the identification and use of validated 
descriptors possessing the neighborhcK^ property. Clearly, this sante method is equally 
applicable to any large assembly of compounds not derived from a combtnatonal synthettc 
30 scheme which can be described as combinations of structural vanattons. Any number of 
additional fields containing information about the structural variation may be added to 
file format, and may be meaningfully used as part of the search critena for subset 
selection. 
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There is special merit in assuring that each product which a user may select from 
this database (virtual library) corresponds to a known synthetic route and known available 
reagents However, the routines which the user applies to select subsets of the virtual 
library described below, do not depend on this. Neither does the representation itself 

5 inherently depend on the assumption of known synthesis pathway. Therefore it can be 
applied to any situation in which the set of compounds of interest can be expressed 
concisely as a core and points of enumerated structural variations. This makes the scope of 
the method, in principle, cover virtually all of small molecule chemistry. In the limU, any 
molecule is divisible into such a representation where there may be only one "structural 

10 variation" known in each list. In fact, the practical advantages of the invention wUl only 
obtain when the number of structural variations is large. 
F. .Searchipp the Virtual Libra ry 

The techniques of constructing and searching the virtual library present the 
molecular researcher with powerful methods of discovery not previously possible and 
15 represent another major advance in the state of the art. Since the virtual library is 
constructed for purposes of finding molecular similarities in structure and function, a 
unique feature of the virtual library is that you can ask questions of similarity in two 
fundamental ways - providing, essentially, two sides of the same coin. The first way is m 
the design of screening libraries - subsets of the virtual library where what is sought are all 
20 those product molecules meeting some set of similarity criteria and not their structurally 
and/or functionally equivalent neighbors (as illustrated in Figure IB). The second way is in 
expanding on a lead compound (lead explosion) - subsets of the virtual library where what 
is sought are all those product molecules meeting some set of similarity criteria to the lead 
and all the structurally and/or functionally equivalent neighbors. Clearly, as a given line of 
25 inquiry is followed, the search for the desired subsets may, at any given level of detail, 
take on aspects of one or the other of these two methods of inquiry. For instance, a search 
for all product molecules matching a lead compound may result in 10 million possibilities, 
in order to make the synthesis and actual screening more efficient, out of these 10 million, 
a screening library may be selected which does not sample the same neighborhood space 
30 more than once. This ability to perform different types of similarity searches underiies the 

discussion which follows. 

Any of the characteristics associated in the virtual library file with each structural 
variation may be searched separately or in conjunction with other characteristics. Since 
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validated metrics are used as descriptors for each structural variation, it is possible using 
only the data contained in the structural variation files to quickly identify those product 
molecules which could be formed from the structural variations similar in structure and 
biological activity to known molecules (such as lead compounds) or arbitrarily chosen 
5 molecules (screening libraries). With the virtual library, a structural search can be earned 
out without having to actually generate and compare any explicit structures of any possible 
product molecules. Subset libraries (screening libraries) representing molecules with 
selected characteristics can thereby be directly created by a search of the virtual library, 
and product structures created and generated only for those molecules included in the 
0 subset library. It is important to understand that the virtual library can be formed from any 
number of combinatorial synthetic schemes or can include molecules which, while not 
based on a combinatorial synthetic scheme, may be expressed in the form of a cSLN. 
Methods of including and searching such molecules will be discussed below. Not only does 
the discovery of a way to create the virtual library make it possible to search an 
5 extraordinarily large number of possible molecular structures, but it also makes it possible 
to do the searching in an extremely efficiently manner and in a very short penod of time. 

Since a variety of data associated with each stn.ctural variation, including that 
resulting from the application of validated metrics, is stored in the virtual library, the range 
of questions (searches) and the types of answers (subset libraries) one can ask of and 
20 receive from the virtual library is virtually unlimited and the number of possible product 
molecules examined to answer the questions is extraordinarily large. As emphasized earlier, 
the virtual library associates precomputed metric values with each structural vanation. 
Library searching is based on the discovery that the metric characteristics of product 
molecules can be usefully estimated by the metric values of the structural variations used to 
25 form the products. As has been seen above, in the case of the Tanimoto fingerpnnt, it was 
also necessary to take into consideration in preparing the precomputed metric values some 
estimation of the core structure. For topomeric field searching, a useful method of 
comparison involves taking the root mean sum of squares differences between the metnc 
field values of one structural variation and another. This value can then be compared to a 
30 chosen neighborhood distance to determine similanty. Finally, it should be recognized that 
in discussing core structures used in combinatorial arrangements, for purposes of creating 
and searching the virtual library, it is possible to consider a singe bond as a core structure, 
in such a case, the structural variations would be combinatorially combined across a single 
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bond. 



As presently implemented by the inventors, the virtual library has to date 170 billion 
possible product compounds representing 70,000 combinatorial reaction schemes over 
various cores, and it is being expanded monthly. The sheer size of the virtual library 
5 suggests that search times must be similarly enormous. However, using the search 
methodology described below, made possible by the construction of the virtual library 
based on validated metrics, real world searching rates of greater than 200 - 500 million 
compounds per hour have been routinely achieved with a single processor. Higher rates are 
achievable on a parallel processing computer with multiple processors such as are now 
10 available from several vendors including Silicon Graphics, Inc. 

i Fv.m ple S»^--^h Routine of V irtn.l I .ih rary - Tanimoto Similarity 
A brief overview of a typical search utilizing 2D fingerprints (a validated metric) 
will highlight the general approach used for all searches of the virtual libraiy, which at 
their most fundamental level, rely on the values of the neighborhood distances found for 
15 the validated metrics. The overall process of using the Tanimoto fingerprint metnc to 
search for molecules is summarized in the flowchart of Figures 21, 22, and 23. A typical 
library based on the combinatorial synthetic scheme utilizing a reactive diamino core will 
be used again as an example. As noted, this synthetic scheme alone contributes 
approximately 112 billion compounds to the virtual library data base. The question typically 
20 presented will ask whether the virtual library contains any molecules having a structure 

likely to yield a biological activity close to that of some known compound. To complete the 
search nothing need be known about the actual chemical compound for which close 
structures are desired, provided a 2D fingerprint for the molecule is supplied. Of course, 
generally the molecular structure of the known molecule is provided and the software 
25 calculates the 2D fingerprint. A particularly important consideration is that the known 

molecule need not have resulted from a combinatorial synthesis and can, in fact, have any 
possible structure. The searching method of this invention independently searches each set 
of associated files generated by the virtual library construction method of the invention; in 
the case of the diamino example, a set of three files as outlined eariier. The reason each 
30 must be searched independently is that the searching program utilizes a knowledge of the 
number of sites (at which structural variations occurred in the synthetic scheme) to analyze 
the closeness of structure to the test molecule. 

Based on knowledge of the neighborhood property of the validated Tanimoto metric, 
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any molecule falling within a neighborhood Tanimoto similarity of 0.85 of another 
molecule should possess similar structural and biological characteristics. For this example, 
a Tanimoto similarity of 0.85 provides the basic selection criteria for examining the virtual 
library data base. Continuing with the example above, the fingerprint of the known 
5 molecule would first be compared to the fingerprint contained in every structural variation 
occurring at each of the two sites (2 x 15,000). The method determines how many of the 
bits set by the known molecule would be set by each structural variation. For all 15,000 
choices at varying site R, (all 15,000 structural variations at R.) the method compares the 
known molecule's fingerprints to the structural variation fingerprint. The same is then done 
10 for all 15,000 structural variations at site R,. Then, for each one of the 15,000 choices at 
varying site R. the number of the matching bits set by that structural variation is added to 
the number of the matching bits for each one of the structural variations at R.. For the 
entire set of structural variations at R, and R„ this involves only the integer addition of 
15,000 x 15,000 terms and may be typically accomplished within fractions of a minute. 
15 As each addition is completed, the resulting sum is compared to the Tanimoto 

neighborhood criteria. Suppose 100 bits were set by the known molecule. If the sum of bits 
totaled 65 and the neighborhood Tanimoto criteria of 0.85 (85 out of 100) were used, it 
would not be possible for any combination of those structural variations to form a molecule 
which would closely match the structure of the known molecule. 
20 As noted above, the method also provides a check (MBITS) on the approximation 

routine used to calculate the fingerprints of the product molecules which would be formed 
from the two structural variations at sites R, and R^. In this example, a typical MBITS 
value of 4 is assumed. Adding the 4 MBITS to the 65 only yields 69 which is clearly not 
within the required degree of Tanimoto neighborhood. However, had the bits from the 
25 structural variations added to 82, then the addition of the MBITS 4 would yield a total of 
86, and the molecule formed from those structural variations would be considered close 
enough to check further. To confirm a match, the fingerprints from the two structural 
variations involved are OR-ed (Boolean) so that commonly set bits are counted only once 
and then compared to the fingerprint of the known molecule. Only if the resulting number 
30 when added to the MBITS term is greater than 85, is the product molecule represented by 
the two variations considered a match and included in a subset library resulting from the 
search. While these additional calculations take extra time, it is only necessary to perform 
them on structural variation combinations which pass the first level of screening (set bits > 



93 



85) Therefore, typically only thousands of extra additions need to be calculated instead of 
n^illions, and the method is very fast. By the method of this invention hundreds of milhons 
of possible compounds may be searched within a couple hours of computer time. 

This testing procedure is continued through every set of structural variation virtual 

5 library files. Different sets of files resulting from other two site synthetic schemes would 
be checked in a similar fashion. When the known molecule was tested against a file set 
constructed from a synthetic scheme having three sites at which a structural variation could 
occur the sum of the matching fingerprints contributed from three structural vanations 
would be used and tested against the fingerprint of the known molecule in an identical 

10 manner. The actual method embodied in the software, performs many quick checks on each 
set of structural variation files and quickly ascertains whether that set of files could yield a 
product structure with the required structural characteristics (fingerprint in this example). If 
the quick check indicates that the set of files could not yield the known molecule, the 
search is quickly advanced to the next set of files. In fact, on a parallel processing 

15 machine, many simultaneous searches are performed. Thus, the time to search the entire 

virtual library is relative short. 

Several points are extremely important. First, the characteristic of the known 
molecule is checked against only files associated with the structural variations. Thus, a set 
of associated files containing 2,000 structural variations (where 1,000 structural variations 
20 may occur at each of two sites) requires the examination of only 2,000 structural variations 
to accomplish a search of 1,000,000 (1,000 x 1,000) possible product molecules. Second, 
during the search only the structural variations which would contribute to a molecule 
having the desired structural characteristics are identified. Only after all such structural 
variations are identified, are the actual product molecules assembled from the structural 
25 variations and their entire structure specified for inclusion in the desired subset. Third, it 
does not matter whether the known molecule could be synthesized by a known 
combinatorial scheme. The information derived from a search such as in the example, 
would identify those molecules which could be derived from a combinatorial scheme which 
most likely have the same structural and biological characteristics as the known molecule. 
30 However, in creating the virtual library, all that is required is that the compounds can be 
described by a CSLN. The searching method of this invention, could equally well find one 
or more of these molecules not derived from a combinatorial synthetic scheme as being 
likely to have the same structural and biological characteristics of the known molecule. The 
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only difference in this later case is that no information about a possible synthetic route is 
available from the results of the search. 

Clearly the greater the number of compounds specified in the entire virtual library 
data base whether based on known combinatorial synthetic schemes or resulting from other 
synthetic pathways and expressed as a CSLN, the greater the likelihood of fmdmg 
n^olecules with similar structural and biological characteristics. Fourth, such structural 
searches require the use of validated metrics exhibiting a neighborhood property to 
characterize both the structural variations and the known molecule. Fifth, once the virtual 
library data base is constructed based on the method of this invention, there are any 
number of differem types of searches which can be run. The software code provided wUh 
this application permits many such searches as outlined in the descriptions of the code 
below. 

n..i pn <ir.re^.mn2 Librarie s_fSuhsets of the VirtuaLLito) 
In the current invention, one single method is used to select among all possible 
products from one or more reactions which share a common core substructure. A bitset is 
used to represent all the possible products (generally in the tens of millions). One may 
choose to limit the design subset selection to those compounds which are made of reagents 
from a specified subset of suppliers, to those of suitable price, to those of suitable 
molecular weight, logP, etc. One may seed the design with a set of preselected products. 
One may remove all products in the neighborhood of a subset of compounds as a preface to 
the design run. 

The design process, once all the above initial subset operations have been 

performed, is extremely simple: 

select a compound to add to the design, and remove its neighbors from 

25 further consideration 

• continue until no other compounds are left 
The selection may be random, or may be directed to maximize use of a reagent once 
selected (this matches the practical requirements for a laboratoi7 two-step synthesis in 
which maximum use of the first step's intermediate structures offers a substantial advantage 
in speed and cost). In principle, any rule can be invoked to prioritize which compound to 
select next, since any remaining compound is allowable at every step. Examples of this 
type of search are given below. 
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(a ) ■■Subset Sc rpi-nin p T.ihra ry Rased On Topomeric Fields 
and Tanimoto 

A selection of a screening library based on the same criteria as were discussed in 
the first part of this application is easily implemented using the virtual library. The library 
5 members are identified based on topomers (is the distance too small in topomer space) and 
on Tanimoto similarity separately, as was done in the earlier disclosed method. However, 
every reagent is always allowed, unlike the earlier method in which only a small subset of 
reagents made it through the reagent filter to the product stage. The earlier methods 
selected products based on maximal dissimilarity of product Tanimoto at each selection. 
10 Since by using the virtual library only the final selection set (all possible combinatorially 
created molecules meeting the selection criteria) is used, and does not depend upon or rely 
upon the ordering within a selected set (of reagents), the virtual library method is more 
flexible and in practice faster than the earlier disclosed method. In fact, since the product 
selection is not constrained by reagent stage selection, somewhat larger screening libraries 
15 result from using the virtual library. The overall process of using both the topomeric 

CoMFA and Tanimoto metrics to search for molecules in the virtual library is summarized 
in the flowchart of Figures 24, 25, and 26. Code to implement this search, db_des, is 
contained in Appendix K. A more extensive description of the code may be found in 

section G which follows. 
20 (h) Subset Rased on Ta nimntn Similarity 

A subset of the virtual library chosen just based on Tanimoto similarity/dissimilarity 
of product molecules, which could be created meeting some initial selection criteria, can be 
directiy chosen. Code to implement this search, dbcslqs, is contained in Appendix I. A 
more extensive description of the code may be found in section G which follows. 
25 ( r. ) Subset Rased on To pomeric Fields 

A subset of the virtual library chosen just based on topomeric CoMFA field 
similarity/dissimilarity of product molecules, which could be created meeting some initial 
selection criteria, can be directiy chosen. Code to implement this search, db_qstop, is 
contained in Appendix J. A more extensive description of the code may be found in section 
30 G which follows. 

(H ) Subset Based nn Combin ed Metric 
A subset of the virtual library may be based as well upon the combined topomeric- 
fingerprint metric described eariier. Code to implement this search, db_both, is contained 
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in Appendix L. A more extensive description of the code may be found in section G which 
follows. 

jj; r>.»dpninf > Optimi/ations 
The various techniques of lead optimization to explore the island of activity were 
5 discussed earlier. The same techniques used with the virtual library are much more 
powerful since a vasUy larger chemical universe is being investigated. Geneially, any 
property associated with a structural variation in the virtual library can be used to expand 
and define the product molecules sought. 

Subsets of molecules from the virtual library database may be selected based on 
10 descriptors typically including, but not limited to, the following: 
reagent identifier 
reagent supplier 

reagent or product molecular weight 
reagent or product price 
j5 • reagent or product estimated logP 

reagent shape contribution; product shape contribution under certain restrictions 

reagent or product 2D fingerprint 
product substructural features 
Subsets may be selected by applying by the following methods, including, but not 
20 limited to, simple filters, by requiring that filters meet a specific degree of simUanty to 
reference compounds, or by applying proprietary design tools. 
Specifically, the initial modes of subset selection may include: 

. substructural searching, to identify compounds which have a set of required 
structural features, is perhaps the most often used method of chemical database 

25 subset selection 

3D feature searching, to add interatomic distance requirements to the 
substructural searching, is also familiar to experts in chemical database 
searching 

. similarity searching, to find subsets which are substantially like a reference 
30 compound, is widely used as well and corresponds to application of a 

neighborhood principle applied to 2D fingerprints or -planned extensions - atom 
pair distance fingerprints, etc. 

scalar searches corresponding to traditional nonstructural database queries, to 
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find compounds with for example logP between 5 and 8 and molecular weight 
under 500 and price above 750. 
. n^aximum dissimilarity queries, which are used primarily to order a large subset 
of compounds such that as one reads down the ordered list, compounds are less 

5 distinct from each other as a group 

STIGMATA (a procedure popularized by the scientists at Parke Davis) queries, 
in which compounds are selected based on the presence of specific bits in a 
fingerprint (2D, atom pair, pharmacophore triplets, etc.). Commonly such a 
query is derived by reference to a set of desirable compounds, from which the 

10 bits present in all compounds in the set are derived. 

• design queries (scalar, topomer, fingerprint, arbitrary weightings of any of 

these) of either of two types: 

gridding methods, in which the objective is to have one compound 
within each specific "hypercube" of the design space 
neighborhood methods, in which the objective is to obtain a set in 
which no two compounds are overiy similar, and in which no "holes- 
exist needlessly 

(c) <;parrh Rased on Tanimoto Similarity 
Deuuls of a typical lead optimization using the Tanimoto metric were highlighted under 
20 section 12(E)(i) above. Essentially, what is sought is a list of all compounds to be found wtthtn 
the Tanimoto neighborhood of the lead. Cede to implement this type of search, db.stm. ts 
contained in Appendix H. A more extensive description of the code may be found in sectton 
G which follows. 

^u) <;^arrhp.s Rased Tn pomer Similarity 
25 The notion of topomer similarity of a pair of molecules is well defined if the molecules 

have some common "core". An enhanced method has also been discovered which allows 
arbitrary structures as search queries not just those which result from a combmatonal 
synthesis. Therefore, to find molecules similar to some target wUhin the virtual lib.^, the 
following three phase operation as summarized in the flowchart of Figures 27, 28, 29, and 30 

30 must be performed: 

1) Determine which of the "common core" substructures (where the core may 
consist of a single bond and any single bond is equivalent to any other single 
bond for topomer searching) within the virtual library are wholly contained 
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within the search target molecule. This can be done by any standard searching 
program, such as Tripos' Unity package. 

2) For each of the common cores found, remove that common core from the 
search target. The atoms remaining will comprise one or more side chams. 
Generate the topomeric conformations of each of the side chains, using the same 
code that is used to build topomeric conformations during library ("all possible 
products") generation. Generate the topomeric conformation of the core. 

3) Using these topomeric conformations of each of the target molecule's s.de 
chains search the combinatorial libraries corresponding to the previously 
identified common cores for all side chains whose sum of corresponding s.de 
chain topomeric differences is less than the neighborhood rad.us withm the 
typical neighborhood range of 80 - 100 kcal/mol. (91 kcal/mol.) Alternatively, 
the root sum of square differences between the fields may be used to determme 
the selection criteria. The procedure is shown in the flowsheet of Figures 27, 
28 29 and 30 and described below. 

' H T. pnm.ric am Sea rcaiim f I 'rWIn ^ n' Mnler„|,r Siructurgs 
,„ .ddition ,o searching .he virtual library as outlined above, i, is possible to conduct 
searches which were heretofore impossible by means. In particular, a critical ,uestto„ 
which frequently occurs in chemical research, and especially in biological research, can now 
: addrel by the discovery and creation embodied ,n the v.rtua, library The problem, as 
U is usually presented, ultes the form, given an arbitrary query molecule ge,,erally n 
previously found to exhibit a desired activity), find biologically similar molecules, tha .s 
molecules of similar 3D shape and activity, that can readily be made and tested Generally, 
such a query molecule will not have resulted from a combinatorial synthesis, and. ,n fact, no 
..owledge of a possible synthetic route to the molecule may be available. As an example, 
suppose that compounds stmilar in 3D shape to but structurally different from the structure 
(written in SLN, CH3C( =0)NHCH(CH3,CH2NHCH2CH20H are des.rable. perhaps because 
thts hypothetical structure was reported to be highly active in a competitive pharmaceufcal 

'"^irdescribed earlier, the topomeric 3D shape data within the Virtual Libraries actually 
describe fragmems .structural variations) of molecules. To find stmilarly shaped molecules 
within the virtual libtary. the query molecule must be fragmented and the shapes of , s 
fragments compared with the shapes of corresponding fragments (structural variahons) ,n the 
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virtual library. The difficulty is that a query molecule can be fragmented in so very many ways 
for searching against the virtual library containing in excess of 10^12 molecules. (The example 
given has nine bonds connecting heavy atoms, so there are nine two-fragment combm^ations 
that could be considered, 9 x 8 = 72 three-fragment combinations, 9 x 8 x 7 - 504 
four-fragment combinations, etc.) Given this situation, what is needed is a way to emphasize 
those fragmentations that are most likely to conform to efficient synthetic routes from available 
starting materials, without requiring the searcher of the virtual library to have any knowledge 
of what synthetic routes it includes. 

The solution to this problem which can be uniquely achieved with the virtual library 
is a "fragmentation table", where each row constitutes a rule of the following sort: "for each 
occurrence of this particular structural feature combination (structural variation) in the query 
molecule, decompose the query molecule in a particular way specified in terms of this 
structural feature, and search only those combinatorial libraries that utilize specified reactions 
(sequences) and/or building blocks, mapping specified query fragments onto specified classes 
of building blocks". Each such query decomposition found generates a search of the virtual 
library, returning all those products whose sum of squares of differences in shape between 
corresponding product and query fragments is less than a user specified neighborhood distance 
threshold. Passing the query molecule (by means of a suitable computer program) against all 
the rows of this table generates all searches. 

TO illustrate this approach with a simple example, one row in the table might have as 
its structural feature C(=0)-[!r]NH (amide bond, where [!r] states that the preceding bond 
must not be cyclic). This row would specify cleavage between the N and C of any matching 
fragment within the query, for our example query yielding the fragments CH3C(-0)- and 
-NHCH(CH3)CH2NHCH2CH20H, and the characteristics that a matching subset library 
should have (primary or secondary amine reacting with an acid, acid chloride, isocyanate, 
chloroformate). The similarity searching engine then returns all products in the virtual library 
formed from amines close enough in shape to -NHCH(CH3)CH2NHCH2CH20H and acylating 
reagents close enough in shape to CH3C( = 0)-. 

Note that the amide bond is a synthetic convenience, not an absolute arbiter of shape 
, similarity. Molecules in which the amide bond is "reversed" might also be sufficiently shape 
similar overall to have biological similarity to the query molecule, despite the local differences 
in shape resulting from the NH to C=0 mismatch. Indeed, any reaction that forms a single 
acyclic bond might contain bioisosteres of our query molecule within its virtual library. On the 
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other hand, an amide library would contain both the most accessible and also the largest 
number of bioisosteres and so this is the library that should first be searched. 

Another row in the fragmentation table might designate a query decomposition into 
three fragments, with a structural feature R-[!r]NXN-[!r]R. Application of this row to our 

5 query molecule would generate CH3C(=0)-, -NHCH(CH3)CH2NH-, and -CH2CH20H. 
When searching the "diamine" library (about 10^11 structures) for similarity using these 
fragments, the "core" or diamine component is searched first for fragments similar in shape 
to -NHCH(CH3)CH2NH- (see below for a description of the special features of core shape 
similarity). Core shape similarity is much rarer than side-chain shape similarity and so an 

10 efficient search process considers core similarity before considering side chain similarity. 

An example of what a few rows in a typical fragmentation table look like is shown 
below. The description of the individual named columns are as follows: 

CLASSJD = equivalent in meaning and value to CLASSJD in the REACTIONS 
table. Identifies a particular reaction sequence as it would be carried out in the laboratory. 

15 Only those virtual library records whose CLASSJD matches this value will actually be 
searched. 

PRIORITY = Allows a searcher to control the depth of a search. Lower values 
correspond to reactions which are less general, but whose produ. ts are more likely to resemble 
a matching query. Deeper searches will also consider rows having higher values of 
20 PRIORITY. 

SLN = the structural pattern that will be matched within the query molecule. Each 
match found within the query molecule generates a decomposition of the query into fragments 
for topomeric similarity searching, as detailed elsewhere. 

REACT ANTS = Allows the developer of this table to limit application of a particular 
25 row to reactions involving particular classes of reactants. 

ATOMS = Specifies, by reference to the fragment description with the SLN column, 
the bonds in the query whose breaking will generate the fragments to be used in topomeric 

field similarity searching. 

The three rows shown illustrate the three examples discussed elsewhere in this 
30 description: Row 1 - diamine derivatizalion; Row 3 - amide formation; Row 7 - thioether 
cleavage. For clarity the information for these rows is broken into three sections: 
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1 


2 

PRIORITY 


3 

SLN 




ROWl 


5 


2.00 


Hev-[!RlNXN-[!RlHev 


ROW3 


6 


2.00 


HevHev( = 0)-[!R]NHev 


ROW7 


22 


2.00 


CS-[!R]HevHev 




4 

REACT ANTS 


ROWl 


XI =RN=C=0,ClC(=0)OR,Epoxide,Ald/Ket,RC(=0)a 
RCOOH,RCOO[-],RS02Cl,ArF(activated) N:CHal C-CC 
X2=RN=C=0,ClC(=0)OR,Epoxide,Ald/Ket,RC(-0)a ^ 
RCOOH RC00[-],RS02Cl,ArF(activated),N:CHal,C-CCA,H 


R0W3 


Xl=Amine(~3)X2=RCOOH,RN = C=0,ClC(=0)OR,RC(-0)Cl, 
RmOM.RS02ClArF(activated),N:CHal,C-CCX _ 


ROW7 


XI =RSH X2=RN=C=S,RN=C=0,RS02Cl,RCl,ArF(activated), 
NrCHal.RBr ' 










5 

ATOMS 


1 ROWl 


1,2 5,4 


3 ROW3 


4,2 2,4 


7 ROW7 


2,3 3,2 
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The power and utility of topomeric steric field analysis of fragmented structures is 
highlighted by a recent analysis of the structures of Tagamet and Zantac (H2 antagonists). 
Tagamet and Zantac were each fragmented according to Row 7 of the fragmentation table and 
the topomeric steric fields calculated. The metric distance (difference in metric values) for the 

two compounds was 127. 

Remembering that a range of 80 - 100 defines a neighborhood distance for an 
approximate log2 biological difference for the topomeric CoMFA descriptor, the value of 127 
strongly suggests that Tagamet and Zantac should have similar biological activities. Such 
knowledge would have been very useful to those either seeking to protect molecules with 
similar structure/activity to the known molecule or to those seeking to find molecules which 
look similar to the receptor but which are not entirely structurally identical to the known 
molecule. It should be noted that other widely used diversity approaches, 2D fmgerpnnts and 
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pKarmacophoric pat«ms show a remarkable .ack of similarity between the drugs. Indeed, in 
,he topomeric configuration genen^ted by the methods of this invention, Tagamet and Zantac 
look very similar even to the unaided eye as shown in Figure 31. 

(H- ) Topomeric P") .S^ari-hing of Pore Structures 
An ancillary problem when attempting to find molecules in the virtual library 
(constructed principally from combinatorial chemistries) which are st^curally and biologtcally 
similar to a given query molecule, is the treatment of the centra core to which structural 
variations can be attached. The virtual library defines the shape similarity of two molecules 
as the sum of the similarities of comparable fragments. -Core- fragments are any fragments 
that have multiple attachment bonds to other fragments, in contrast to "side chain" fragments 

which have only one attachment bond. 

Overall molecular shape will be affected most by the relative positions of core 
attachment bonds. Consider the three possible bivalent phenyl cores, oriho, meta, and pa,^. 
These will be quite similar in their intrinsic shapes ■ only a hydrogen changes place - but the 
molecules derived from the three cores will be very different in shape if the side chains are 
a, all bulky. Therefore in considering the shape similarities of cores the relative posttions of 
attachmem bonds must be weighted far more heavily than the shape differences themselves. 

The prior art has attempted to deal with this problem. I-uri and Bartlet. have 
described CAVEAT, which in the nomenclature of this disclosure would be considered a "core 
, similarity" searching system that considers only relative atuchment bonds, not shape, of all 
theoretically construCible cyclic cores. In their work, the relative geometry of two attachment 
bonds is expressed in terms of their distance, angle, and torsions. In contrast the present 
inventors have found that a much more self-consistent shape classification of, for example, all 
750 commercially offered diamines, is obtained when one of the aturchmen. bonds is altgned 
5 on the X-axis (as in the standard topomer conformation, described earlier) and the differences 
calculated as the root mean square of summed differences in the x, y, and z coordinates of the 
two ends of the other atuchment bond. (The conformation used in this procedure is the 
topomeric conformation of the core with a methyl group replacing the more distant attachmem 
bond.) This procedure differentiates cyclic from acyclic fragments much more strongly than 
iO it differs among the linear acyclic moieties pentylyl, hexylyl, and heplylyl. 

in addition to this RMS difference in x, y, and z, the differences in steric {and any 
other fields) also contribute to the bioisosteric differences between two cores. Because there 
are potentially two or more possible attachment bonds m a core, there are two or more ways 
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in which two or more cores may be compared. So the difference in fields is taken as the least 
of these possible differences. The combination of two descriptors in considering the difference 
between two core structures, the attachment bond differences and the field differences, 
introduces a relative weighting concern. In practice it has been found in clustering expenments 
5 like those described for the thiols that the internally most self-consistent classificaUon of 750 
diamines results from numerically equal weighting of the two RMS differences. 

Thus the successful generation of a topomeric descriptor for cores involves two 
advances. In comparison with the procedure for side chains, the relative position of attachment 
points has been introduced, for example, to distinguish ortho phenylene from para phenylene. 
10 In comparison with the treatment of attachment points previously described by Bartlett et al., 
the use of differences in x,y,z coordinates, rather than relative geometries such as distances 
and solid angles, provides a stronger differentiation needed between, for example, cyclic and 
acyclic cores. 

n rode Attachments 

15 The following software code comprising the main sections of the invention is described 

below and is attached in the Appendices. In addition, necessary auxiliary code is also set forth 
in the Appendices. All together, all code necessary to fully disclose an enabling embodiment 
of the invention in the computational chemistry environment specified earlier is set forth in the 
several appendices. In some cases new code is provided which differs from that in the pnonty 

20 documents to include enhancements described in the text. In particular, as the virtual library 
has been expanded, it has been found that the larger number of compounds identified from the 
searches is more conveniently handled which can deal with bitsets rather than as ASCII text. 
The additional auxiliary code required to manipulate the bitsets is contained in Appendix R. 
However, the use of bitsets is a computation convenience and does not involve any change in 

25 the construction or searching of the virtual library. 
Ap pendix A: 

One section of the code m this Appendix generates topomeric conformations, and 
another section generates the best slope line for Patterson plots. 

Appendix B: This code calculates the hydrogen bond variation to be applied to the 

30 CoMFA steric field. 

App.nHW R- Petacd.core This code handles the first phase of the construction of the 

virtual library. 

p- <:VR MHRN GPLS COMFA HF.X CTOPS This code calculates the 
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„p„n,.ric COMFA f.eld of «ch structural variation and adds it to tlte structural variauon files, 
/also allows the compuUtion and use of other than Just steric fields. This Sybyl expresston 
generator, written in C. in invoiced from SPL by a call %comfa_hex(Row Column). It returns 
an ASCII hexadecimal reptesentation (0-9,a-f) for each CoMFA grid point in row "Row and 
CoMFA column -column- in the string which is seen as CTOPS in .he input files. 

The encoding is as shown in the subroutine loolcup.my_comfa_codeO. As md.cated, 
a missing value is assigned -0- and all legiUmate values are assigned a number accordmg to 
their numerical value. Tlte binning is no, quite linear; since the CoMFA valuesje 
infrequently between 10 and 30 this was empirically found to reproduce the exact CoMFA 
) distances very well. The distances arising from this CTOPS description were validated ag.ns 
data sets to confirm that the encoding and decoding introduced no significant roundoff 
problems. The distance corresponding to the coded topomer field values of CTOPS are seen 
in the dbcsln_des routine called WhatsTheDifferenceQ. 
4p pi-nilix G ; dlvdnprenro 

This program takes the descnp.ion of the common core and solicits for each subsUtuted 
position the SLN for the extended core. From this, ^d the list of structural variattons, ,t 
computes the fingerprint and the fingerpnnfs cardinality for each structural variation and 
appends this as the fpcard and fp fields. 

Additionally, the program creates a specified fraction of product compounds and 
20 computes their fingerprints exactly. The actual product fingerprint is compared .o the 
fingerprint estimated from the pieces, and any discrepancy is noted by counting how marty 
tested products have 0 missing bits, how many have 1 , etc. The largest observed value is used 
as the MBITS parameter for the reaction. The new version of this code performs the same 
functions as the original code except that it writes separate files for fpcard and fp. In addtuon, 
25 it forms a master file to keep track of the association of all the files. 
Ap pendix H: dbcslnsim 

This progmm takes one or more SLN structures as queries, along with the MBITS and 
the desired Tanimoto similarity, and the output of the dbcs.nprepro run. It produces a hsttng 
of all products which may be above the Tanimoto cutoff value, by listing the index of each 
30 stn.ctural variation and both the apparent Tanimoto and the maximum possible Tantmoto (,t 
is the maximum possible Tanimoto which defines the results). This code now reads master files 
and can read bitsets output from other files. 
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Ap pendix I: dhcslnOS 

This program uAes .he results of .he dbcslnprepro program, along with U,e MBITS and 
.heTanimou, similari.y neighborhood, .o selec. a designed subset based on Tanimoio similan.y 
alone. Additional options allow one to remove from consideration products with a parameter 
5 outside of the desired range (such as molecular weight or logP or price), and .o remove a^l 
prcKiuCs whose enumerated Helds for one or more reagents are no. in a lis. of ac^ptabie 

choices (such as supplier). 

The design selection consisls of first removing products from consideration based on 

™geof variables or accepubili.y of reagen.. An initial selection is made, normally by random 
,0 selection among al, remaining producs. Every produc. whose maximum possible Tan,mo» 

similarly is above the cutoff is removed from further conside^ion. A produc. is .hen sel^ted 

from among all remaining products, either randomly or by .le to continue using one of the 

reagents (Rl, K2,etc) so long as possible (so long as any produc. remains using that reagent . 

This selected product's neighbors are removed from further consideration also, and thts simple 
,5 loop continues until no products remain or a maximum specifted number of selections have 

been made. The loop is simply: select, remove neighbors in Tanimoto space. 
Ap pp.ndix J: Hhrlsn astoD 

This program ukes the results of the dbcslnprepro program, along witi, a value to 
define the topomeric similarity neighborhocd. to select a designed subset based on topomenc 

20 similarity alone. . 

This program operates exactiy like dbcslnQS. except that the step at whtch netghbors 
are removed is based on topomeric similarity based on the CTOPS fields of the reagents, 
rather than the estimate of Tanimoto similarity. Thus after a selection it scans all rema^mng 
products .0 find every one which has a disunce wi.hin .he similarity radius, and marks .hese 
25 neighbors as unavailable for further considera.ion. 

(No.= that this is cuivalent to doing a topomeric similarity search for each selecnon. 
The results are not returned to the user, since their use is to make potential selections 

disappear!) 

App^nHW K: dbcsln des 

30 This program takes the results of the dbcslnprepro program, along with the MBITS and 

the Tanimo.0 similari.y neighborhood, plus a value .o define .he .opomeric simtlarrty 
neighborhood to select a designed subset based on Tanimoto similarity and topomeric stmtlanty 
acting independently. This corresponds closely to the method of designed subset selection tn 
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the earlier described method. This code now reads and writes master files and bitsets. 

This program operates exactly like dbcslnQS, except that in addition to removing every 
Tanimoto neighbor of the selected compound, we also remove the topomeric neighbors. Thus 
after a selection it scans all remaining products to find every one which has a distance withm 
5 the Tanimoto range, removes them, scans all remaining products to find every one which has 
a distance within the topomer range, and removes them. 

This is equivalent to doing the dbcslnQS and dbclsn_qstop one after another in the 
innermost loop where neighbors are identified and removed. By setting either the Tanimoto 
or topomer neighborhood radius to be zero, one should be able to achieve the same results as 
10 dbclsn_qstop or dbcslnQS in fact. 

Appendix T.- dhcsln both 

This program takes the results of the dbcslnprepro program, along with the MBITS and 
a way to scale topomeric distance, plus a similarity cutoff for the combined descriptor of 
topomer and Tanimoto, to select a designed subset based on Tanimoto similarity and topomenc 
15 similarity acting as one combined descriptor. 

This program operates exactly like dbcslnQS, except that the removal of neighbors is 
not based on either Tanimoto or topomeric distance by itself. 

This utilizes the new, combined descriptor described eariier. It is not direcdy equivalent 
to either dbcslnQS or dbclsn_qstop in this sense. This code now reads and writes master files 

20 and bitsets. 

Ap ppndix M: dbcslntohits 

This program takes the index results of dbcslnQS, dbclsn_qstop, dbcsln_both, 
dbcsln des, or dbcslnsim and generates a full product structure SLN hitlist for them. This 
hitlist of products is suitable for treatment just as any set of chemical compounds - it loses its 
25 combinatorial identity as it becomes an assembly of independent chemicals. The new version 
of this code can now work with bitsets. 

Ap pendix N: CODATA 

This is a header file to declare variables. 

Ap pendix O: DB UTL 

30 This code is a set of subroutines used in many places, and, in particular, by the design 

programs. 

Appendix P: FLIMATE 

This code is a set of subroutines used in many places, and, in particular, by the design 
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programs. 

Apjv.nriiy O: FILTER 

This code contains subrouUnes for filtering undesired characteristics from product 

molecules. 
«5 Appendix R: ^hr.'iln bitset 

^r;;^;;;^;;^;^i"onal rou,i„es nead and called by the Cher code >o handle 

bitsets. 

^pr^nHiY .S: topsim 

This cede performs a .opomeric CoMFA search for molecules similar to a query 

10 compound. 

Ap ppniiix T: tr.ps(>.tiip.core 

This code performs .he fragmentation required to implemeht a topomenc search of 
query molecule not necessarily derived from a combinatorial synthesis. 

From the proceeding description of the construction, generation, and searching of a 

,5 virtual library, it should be clear that there are many variations whtch may be employed and 
ralg.aug7howtosenera.eandsearchonespecificembodiment.al.equivale„tembod,ments 

are considered within the scope of this disclosure. 

While the preceding written description is provided as an aid in understand.ng. tt should 
he understood that me source code listings appended to this application constttute a comp etc 
20 disclosure of the best mode currently ^own to the inventors of the methods of construchng 

characteristics. . , r fh« Hn.a 

Thus while this invention has been particularly described w,th reference to the drug 
lead identification art. It is clear that the validatton of molecular structural descriptors and the.r 

25 use ih selecting structurally diverse sets of chemical compounds can be applied anywhere a 
large number of compounds is encountered from which a representative subset is destred. S.nce 
the implications and advances ,n the art provided by the methCKls of this invention are st, 1 so 
new the entire range of possible uses for the methods of this invention can not be fu ly 
described at the present ttme. However, such as yet identified uses are considered to fall under 

30 the teachings and claims of this invention ,f validated molecular structural descnptors are 
employed to characterize the diversity of molecules. 
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APPENDIX "A" 



©expression _generator CH0M_THIS_BUILD_3D 



# top level routine for generating topomeric conformer 

# CHOM!INIT_BIULD_3D must be called beforehand 

# returns true unless something went wrong 
# 

10 globalvar CHOM'.Align 

localvar ma msav rid pat tpat p sin noth zs al n capsln \ 

polypat patats mpats allpatats 
localvar polyats matl mat2 schns rbs sybat aneigh ans i \ 
mcore jbds tors msln 
15 itvar ma $1 
setvar rid $3 

setvar capsln $CHOM!Align[ SLN ] 
setvar polypat SCHOM'.polypat 
setvar mcore $CHOM!Align[ MINIT ] 
20 setvar msln $CHOM!Align[ MSLN ] 
# fix N02's (egad what a pain) 
setvar pat %search2d( %sln( $ma ) N(=0)0 ALL 0 y ) 
while Spat 

setvar pat %sln_rgroup_sybid( $ma %arg( 1 Spat ) 1 3 ) 
25 modify bond type %bonds( %cat( %arg( 1 Spat ) \ 

" = " %arg( 2 Spat ) ) ) 2 >$nulldev 
modify atom type %arg( 2 Spat ) o.2 >$nulldev 
setvar pat %search2d( %sln( Sma ) N( = 0)0 ALL 0 y ) 

endwhile 
30 if SCHOM!Align[DEBUG] 

label id * 
endif 
tf basic optimization 
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switch $2 
case NOBUILD) 

case CONCORD) 
5 if %not( %chom_concor(l( $ma )) 
goto bad_energy 

endif 
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case 

MINIMIZE) 

MAXIMIN $ma DONE INTERACTIVE >$nulldev 
if %gt( $maximin2_energy 1000 ) 
goto bad_energy 

endif 



15 endswitch 

setvar CHOM!Align[ RBDS ] 

# done, if only 3d coord, but for CoMFA .. 
if %streql( $4 "A" ) 

# detect (pro)chiral atoms i.. adjustment, adjusting and 
20 # removing any of pre-defmedchirality 

setvar CHOM!Align[CHIRAL] %set_create( %atoms({ch.ral(*,RS)}) ) 

# find a 2D hit 

setvar pat %search2d( %cat( %sln( $ma ) ) Scapsln NoDup 0 y ) 

if %not( $pat ) . . 

25 echo Scapsln no> found in %sln( $ma ) from Row $rid .. stapp.ng 

return 
endif 

setvar pat %arg(l $pat ) 
, now fmd Che (firs,) pattern that matches the aligning fragment AND whose 

30 # atoms are contained by this SLN hit 

setvar allpatats %set_create( %sln_rgroup_sybid( \ 

$ma $pat %range( 1 %sln_atom_count( Scapsln ) ) ) ) 
setvar mpats %search2d( %cat( %sln( $ma ) ) $msln NoDup 0 y ) 
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for pat in $mpats 

if %not( %set_diff( %set_create( %sln_rgroup_sybid( $ma Spat \ 
%range( 1 %sln_atom_count( Scapsln ) ) ) ) $allpatats ) ) 

break 
5 endif 
endfor 

setvar polyats %set_create( %sln_rgroup_sybid( $ma $pat Spolypat ) ) 

# allow user supplied routine to adjust initial conformer 

if $CHOM!Align[ FIX_CF_CALLBACK ] 
10 $CHOM!Align[ FIX_CF_CALLBACK ] $ma Sallpatats 

endif 

# collect all atoms for MATCH and 

# and all the info on roots of torsions needing setting 
^ ( = = all bonds to atoms that are 

15 # polyvalent within the aligning fragment, except bonds that are (1) 

# in rings or (2) connected to some other atom polyvalent within the 

# aligning fragment). 

setvar matl 
setvar mat2 
20 setvar schns 

setvar rbds %set_create( %bonds((rings()}) ) 
for a in %range( 1 %sln_atom_count( $msln ) ) 
setvar matl $matl $CHOM!patats[ $a ] 
setvar sybat %sln_rgroup_sybid( $ma Spat Sa ) 
25 setvar mat2 $mat2 Ssybat 

n build torsion root lists 

if %set_and( Ssybat "Spolyats" ) 

setvar aneigh %set_create( %atom_info( Ssybat NEIGHBORS ) ) 
setvar ans %set_diff( Saneigh Spolyats ) 
30 for i in %set_unpack( "Sans" ) 

if %eq( %count( %atom_info( Si NEIGHBORS ) ) 1 ) 

goto notoroot 
endif 
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if $rbds 

if %set_and( $rbds %bonds( %cat( $i " = " $sybat ) ) ) 

goto notoroot 
endif 
endif 

setvar tors %set_diff( Saneigh $i ) 

# if there are multiple possible torsional root, 

# get one that is part of the root main chain 
if %gt( %set_size( "$tors" ) 1 ) 

if %set_and( "$lors" $polyats ) 

setvar tors %set_and( $tors Spolyats ) 

endif 

endif 

# if there are still multiple choices, just have to pick arbitrarily 

15 if $tors 

setvar tors %arg( 1 %set_unpack( $tors ) ) 
setvar schns Sschns %cat( $sybat $tors $i ) 
endif 

notoroot: 
20 endfor 
endif 
endfor 

setvar dofit MATCH %cat( $mcore %set_create( $matl ) ")" ) \ 
%cat( $ma "(" %set_create( $mat2 ) ")" ) 
25 $dofit >$nulldev 

if $CHOM!Align[DEBUG] 

echo %prompt( INT 1 " " " " ) 
endif 
# do FIT 

30 if %gt( $MATCH_RMS $CHOM!Align[ FITRMS ] ) 

setvar CHOM'.BadRows %set_or( "$CHOM!BadRows" $rid ) 

echo Bad geometnc alignment (MATCH_RMS = $MATCH_RMS) \ 

for Row Srid . . skipping 



114 



15 



return 
endif 

# side chain alignments .. 

switch $CHOM!Align[ ALICYC ] 

5 case User_Macro) 

$CHOM..AUgn[ ALIDATA ) $n,a SCHOMlALIGNt MCORE ) 

case All jrans) 
case With_Templates) 
10 setvar nqj rings TRUE 

for i in $schns 

setvar jbds %set__unpack( $i ) 
# can set "side chain" bonds only if connecting bond is not cyclic 
if %set_and( "Srbds" "%bonds( %cat( %arg( 3 $jbds ) \ 
= %arg( 1 $jbds ) ) )" ) 
setvar nojrings 

else 

CHOMiAllTrans Sjbds 
endif 

20 endfor 

if $CHOM!Align[DEBUGl 

echo %prompt( INT 1 ) 
endif 

if %streql( $CHOM!Align[ ALICYC ] With_Templates ) 
25 setvar f %open( $CHOM!Align[ ALIDATA ] "r" ) 

setvar buff %read( $f ) 
setvar slnma %cal( %sln( $ma ) ) 
while $buff 

# each line of text should have pattern, SLN IDs for the 4 torsion atoms, 
3Q ^ and a torsion value to set 

if %eq( %count( $buff ) 5 ) 

setvar torpat %search2d( $slnma %arg( 1 Sbuff ) NoDup 0 y ) 

for t in $torpat 
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MODIFY TORSION %sln rgroup_sybid{ $ma $t %arg( 2 $buff ) \ 
%arg( 3 $buff ) %arg( 4 Sbuff ) ) %arg( 5 Sbuff ) >$nulldev 

endfor 
endif 

5 endwhile 

%close( $f ) 
endif 

endswitch 
10 endif 

# do a bump check? 
if $CHOM!Align[BUMPS] 
if %atoms({bumps(*,*)}) 

echo Bad steric contacts in aligned conformer for \ 
j5 Row $rid skipping 

return 
endif 
endif 
# partial charges 
20 switch $CHOM! Align[ CHARGE ] 
case None) 
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case User_Macro) 

exec $CHOM!Align[ CHARGEDATA ] $ma 



case ) 



CHARGE $ma 



COMPUTE $CHOM!Align[ CHARGE ] i >$nulldev 



endswitch 
30 %retum( TRUE ) 
return 
bad_energy: 
echo Minimization failed - skipping molecule 
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return 
it. 

#= = = 



5 ©macro ALLTRANS chom 

# assumes default molecule, takes argument atoms $1 and $2 

# where $1 is the JOINed atom of the core, $2 is the atom that 

# the rest of the substituent is to be trans to, 

# and $3 is the JOINed atom of the substituent 
10 # starts from that atom and sets all side chains 

# to a trans conformation 

# where choices exist, the largest chain is set to trans 

# and secondary chains "fall whereever they fall" 

# manages chain branchings 

15 # ignores ring bonds 

globalvar CHOM'.Err CHOM'.Align 

localvar bds b bdset al a2 tmp sbonds sats rbond pbds tors.on nngbonds 
localvar doit chir cats rgjoined b2set tN-al 
if %and( "Sbatch" "$CHOM!Err" ) 
20 RETURN 
endif 

# warn if angles will be ambiguous 

# setvar chir %set_create( %atoms({chiral(*,RS)}) ) 

# check input for legality 

25 setvar tmp %set_create( %atom_info( $1 NEIGHBORS ) ) 
if %not( %eq( 2 %count( %set_unpack( %set_and( \ 
"$tmp" %cat( $2 ","$3)))))) 
echo Bad input to ALLTRANS (atoms $2 $3 not bonded to $1) 

return 
30 endif 

# save key bonds 

setvar rbond %bonds( %cat( $3 " = " $1 ) ) 
setvar sats %conn_atoms( $3 $1 ) 
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if %not( Ssats ) 

# echo No substituent atoms found in ALLTRANS 

return 
endif 

5 setvar sats $3 Ssats 

setvar sbonds %set_create( %bonds( %cat( \ 

"{TO_ATOMS(" %set_create($sats) ")}" )) ) 

# define the other bonds that might need adjusting 

setvar bds %set_create( %bonds( (*-{RINGSO})&< 1 > ) ) 
10 setvar bds %set_and( "Ssbonds" "$bds" ) 
if %not( $bds ) 

return 
endif 

# discard bonds to primary atoms 

15 setvar mval %set_create( %atoms( \ 

<H> + <o'^>-l-<F> + <I> + <CI> + <Br>4-<n.l> + <LP> + <Du>)) 

setvar pds %set_create( %bonds( %cat( "{TO.ATOMSC' $mval ")}" ) ) ) 
setvar bds %set_diff( $bds $pds ) 

setvar CHOM!Align[ RBDS ] %set_or( $bds "$CHOM!Align[ RBDS ]" ) 
20 setvar ringbonds %set_create(%bonds({RlNGS()}) ) 

# walk all the important bonds 
for b in %set_unpack( $bds ) 

setvar doit TRUE 

# if this is the JOIN bond, already have some info 
25 if %eq( $b Srbond ) 

setvar aO $2 
setvar al $1 
setvar a2 $3 

# still need to be SURE we're not monovalent 

30 if %or( "%eq( 1 %count( %atom_info( $al NEIGHBORS ) ) )" \ 

"%eq( 1 %count( %atom_info( Sa2 NEIGHBORS ) ) )" ) 
setvar doit 
endif 
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else 

setvar bdat %bond_info( $b ORIGIN TARGET ) 
setvar al %arg( 1 $bdat ) 
setvar a2 %arg( 2 $bdat ) 
5 if %or( " %eq( 1 %count( %atom_info( Sal NEIGHBORS ) ) )" \ 
••%eq( 1 %count( %atom_info( $a2 NEIGHBORS ) ) )" ) 
setvar doit 
endif 

if $doit . , 

» ^^^rr,-} if nprpssarv flip al,a2 to make that one be ai 
10 # which end leads to root atom? if necessary lup a , 

if %set_and( "%set_create( %conn_atoms( $a2 Sal ) )" $1 ) 
setvar tmp Sal 
setvar al $a2 
setvar a2 Stmp 

15 endif 

setvar aOpath %trans_path( Sal Sa2 SI ) 

setvar aO %arg( 1 SaOpath ) 

endif 

endif 

20 if Sdoit 

setvar a3path %transj,ath( Sa2 Sal SCHOM!ALIGN[ attached ] ) 
setvar a3 %arg( 1 $a3path ) 

setvar b2set %bonds( %cat( SaO " = " Sal $a2 $a3 ) ) 
setvar rgjoined %set_and( "Sringbonds" %set_create( Sb2set ) ) 
25 setvar nrgjoined %count( %set_unpack( "Srgjoined" ) ) 
setvar b2 %arg( 2 $b2set ) 
if %eq( 0 Snrgjoined ) 

setvar torsion 180 
else 

30 if %eq( 1 Snrgjoined ) 

setvar torsion 90 

else 

setvar torsion 60 
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endif 
endif 

modify torsion $aO Sal $a2 $a3 Storsion >$nulldev 
if %set_and( "Scats" Sa2 ) 
5 MEASURE TORSION %arg( 2 $a3path ) Sal Sa2 Sa3 > SnuUdev . 

setvar torsion Smeasurejorsion 
while %U( Storsion 0 ) 

setvar torsion %math( Storsion + 360 ) 

endwhile 
10 if %gt( 180 Storsion ) 

CHOM'.Reflect Sa2 Sal %arg( 1 SaSpath ) \ 
%arg( 2 SaSpath ) %arg( 3 Sa3path ) 

endif 

15 L'ar CHOM!Align[ CHIRAL ] %seL^iff( "SCHOMiAUgnl CHIRAL ]■ $a2 ) 

endif 
endfor 
#. 

©macro Reflect CHOM = = = = = = 

20 = = = = = = = = = = = = = = = 



# does a controlled inversion, to convert prochiral atom to topmeric sterreoform 

localvar arefl 

rrf ^( <Di " »• <to " " %C\ \ Pi >$nulldev 
DEFINE PLANE %cat($l , 3*2 , ; 

25 setvar arefl $4 

setvar arefl %set_or( Sarefl "%set_create( %conn_atoms( S4 SI ) )" ) 

if S5 

setvar arefl %set_or( Sarefl $5 ) 

setvar arefl %set_or( Sarefl "roset.createC %conn_atoms( S5 $1 ) )" ) 

30 endif 

REFLECT Sarefl PI >$nulldev 
REMOVE PLANE M* PI >$nulldev 

#. 
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@expression_generator CHOM.CONCORD ^^^^ = = = = = = = = = 

JJJs U^b'esuo generate a concord structure for the specified workarea 
5 localvar ma p pat msav noth try 
setvar ma $1 

# fix indole atom typing problem 

setvar pat %search2d( %sln( $ma ) NH(:C):C ALL 0 y ) 
for p in Spat 

10 setvar tpat %sln_rgroup_sybid( $ma $p 1 ) 

modify atom only $tpat N.ar 1 >$nulldev 
endfor 

# renumber heavy atoms to avoid other problems 

# echo before renumber: %sln( $ma ) 
15 setvar mrenum %molemptyO 

renumber $ma Smrenum %atoms( *-<H> ) >$nulldev | 

copy Smrenum $ma 
zap Smrenum 
setvar msav %molemptyO 
20 copy $ma Smsav 

setvar nats %mol_info( Sma N ATOMS ) 
DEFAULT Sma > Snulldev 
for try in %range( 1 3 ) 

CONCORD M Sma > Snulldev 
25 , concord can :..um bond-less securest or son.e different structure or do nothtng 

setvar cok TRUE 

if %not( %eq( %mol_info( Sma NATOMS ) Snats ) ) 

setvar cok 
endif 

30 if %eq( 0 %mol_info( $ma NBONDS ) ) 
setvar cok 
endif 
if $cok 
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setvar noth %arg( 1 %atoms( <H> ) ) 
if $noth 

measure distance $noth %atomJnfo( $noth NEIGHBORS ) >$nulldev 
setvar cok %gt( $measure_distance 0.9 ) 
5 endif 
endif 
if $cok 

break 
endif 

10 echo Concord failed try $try 
#echo %prompt( INT 2 ' ' ) 

copy $msav $ma 
endfor 

if %not( $cok ) 
15 if %not( $CHOM!Align[ FAST ] ) 

echo Concord failed for %sln( $ma ) - minimizing 
copy $msav $ma 
for try in %range( 1 4 ) 
MAXIMIN $ma DONE INTERACTIVE 
20 if %lt( $maximin2_energy 1000 ) 

break 
endif 

%file_delete(junk.his) >$nulldev 

DYNAMICS ml SETUP junk.his DONE Interval_Length \ 
25 300.0 DONE FINISHED INTERACTIVE 

if %eq( $try 3 ) 
zap $msav 
return 
endif 

30 endfor 
else 

echo Skipping non-Concord structure 
zap $msav 
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return 
endif 
endif 

zap $msav 
5 if $CHOM!Align[ CORE_SLN ] 

# need to find and record other attachment point for trans_path - 
^ standard aligning group 

setvar args $CHOM!Align[ CORE_SLN ] 
setvar msln %string_insert( %arg( 1 $args ) \ 
IQ %arg( 3 Sargs ) %arg( 2 $args ) ) 

setvar msln %string_insert( $msln %arg( 4 Sargs ) Rl ) 

# can't begin SLN with ( 

if %eq( 1 %pos( "(CH2" $msln ) ) 

setvar msln %cat( "CH2(" %substr( $msln 5 ) ) 

15 endif 

setvar rid %sln_rgroup_slnid( $msln ) 
setvar hit %search2d( %sln( $ma ) $msln NoDup 1 y ) 
if %not( $hit ) 

while %pos( ":" $msln ) 

20 setvar msln %string_insert( Smsln ":" " ~ " ) 

endwhile 

setvar hit %search2d( %sln( $ma ) $msln NoDup 1 y ) 
endwhile 

setvar rats %sln_rgroup_sybid( $ma Shit Srid ) 

25 if %not( Srats ) 

echo Pattern Smsln not found in %sln( $ma ) - missing core attachment 

return 

endif 

for cat in %set_unpack( Srats ) 
30 if %gt( %count( %atomJnfo( Scat NEIGHBORS ) ) 1 ) 

break 
endif 
endfor 
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setvar CHOM!Align[ ATTACHED ] %set_create( $cat \ 

%set_diff( %set_create( %atom_info( Scat NEIGHBORS ) ) Srats ) ) 

endif 

%return( TRUE ) 

5 #. 

©macro INIT_BUILD_3D CHOM 



# prepare and generate global data about template fragment 
10 globalvar CHOMIpatats CHOMIpolypat 

localvar mcore msln capsln patats ys rat yrat nrat tpat a 

# setvar mcore $CHOM!Align[ MCORE ] 
if $1 

setvar mcore $1 

15 else 

setvar mcore %molemptyO 

endif 

default $mcore > Snulldev 
if $CHOM!Align[DEBUG] 
20 label id * 
endif 

setvar capsln $CHOM!Align[ SLN ] 
# use as is 

if $CHOM!Align[ ORIENT ] 
25 n orient template so that an R points in the positive X direction 
setvar ys %set_unpack( $CHOM!Align[ ORIENT ] ) 
setvar rat %arg( 1 $ys ) 
setvar nrat %arg( 2 $ys ) 
setvar yrat %arg( 3 $ys ) 
30 ORIENT USER $rat $nrat $yrat > Snulldev 

endif 

# identify all the atoms for FIT, 

# Here we identify the SLN IDs of the polyvalent atoms 
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setvar tpat %arg( 1 %search2d( $capsln Scapsln NoDup 0 y ) ) 
setvar polypat 
setvar CHOMIpatats 

echo %sln_to_mol( $mcore Scapsln ) >$nulldev 
5 for a in %range(l %sln_atom_count( Scapsln ) ) 

setvar CHOM!patats[ $a ] %sln_rgroup_sybid( Smcore $tpat $a ) 
if %gt( %count( %atom_info( $CHOM!patats[ $a ] NEIGHBORS ) ) 1 ) 
setvar polypat Spolypat $a 

endif 
10 end for 

if $CHOM!Align[DEBUG] 

echo %prompt( INT 1 * ) 
endif 

copy $CHOM!Align[ MCORE ] $mcore 
15 zap $CHOM!Align[ MCORE ] 
setvar msln %sln( $mcore ) 
setvar CHOMlpolypat Spolypat 
setvar CHOM!Align[ MINIT ] Smcore 
setvar CHOM!Align[ MSLN ] $msln 

20 n. 
II-C 

/*#module SYB_MGEN_CONN_ATOMS "Vl.O"*/ 

#include <ctype.h> 

^include < string. h> 
25 ^include <stdio.h> 

^include "ta_config.h" 

#include "tajypes.h" 

^include "utl_mem.h" 

#include "uims2.h" 
30 ^include "ta_math.h" 

^include "utl_geom.h" 

^include "utl_str.h" 

^include "molecule.h" 
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#include "utljist.h" 

#include "syb_uims_def.h" 

#include "uims2/macros_proto.h" 

#include "syb/expr_p.h" 
5 ^include "syb/area_p.h" 

#include "syb/atabj.h" 

^include "syb/atom_p.h" 

#include "uims2_p.h" 

#include "utl_set.h" 
10 /*E+:SYB MGEN_CONN_BEST*/ 

* int SYB_MGEN_CONN_BEST( identifier, nargs, args, writer ) 

* Dick Crame'r, Apr. 9, 1995 (written for SELECTOR use) 

15 * Expression generator that returns the atoms attached to a given 
atom, excepting the second, in a prioritized order. 

* If there are two arguments, the ordering is by decreasing branch 

"size", where "size" is first any path with rings encountered, then 

* number of attached atoms, then MW (paths in cycles end when an atom 
20 * in another path is encountered.) 

* If three arguments, the atom that is returned is the one that 

* begins the shortest path containing either of up 

* to two atoms referred to by the 

* third argument. If multiple such paths, ordering is same as for 
25 * two arguments. 

* If last argument is DEBUG, all paths are written to stdout. 

* User interface: 

* %trans_palh( al a2 ( a3 ) (DEBUG) ) 

int SYB_MGEN_CONN_BEST( identifier, nargs, args, Writer ) 
char ^identifier; 
int nargs; 
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char *argsD; 
PFI Writer; 

{ 

# define MAX_NP 8 
5 struct pathrec { 

int root, nrings, chosen, nats, done; 

float mw; 
set_ptr path; 
atoni_ptr a; 

10 } ; 

struct pathrec p[MAX_NP]; 

int retval, i, np, toroot, al, a2, a4, a5, a, pnow, pdone, growing, 
final_pos, area_num, new_rings, nats, nuats, elem, ncycles, 
best, debug, ringclosed, p2do; 
15 List_Ptr atom_exp_list=NIL; 

mol_ptr ml, m2; 

atom_ptr arecl, arec2, arec, a4rec; 

setjtr atom_setl=NIL, a2chk = NIL, nuls = NIL, cnats = NIL, 
nxcn = NIL, end_atoms = NIL, scratch = NIL; 
20 char tempString[256]; 

float tl, t2, diff, potl, pot2, podiff; 

retval = 0; 

/* Check the number of arguments */ 
if ( nargs < 2 | nargs > 4 ) { 
25 UIMS2_WRITE_ERROR( 

"Error: %trans_path requires 2 to 4 arguments\n" ); 

return 0; 

} 

np = 0; 

30 debug = (!UTL_STR_CMP_NOCASE( args[ nargs - 1], "DEBUG" )); 

toroot = (debug && nargs == 4) 1 1 (Idebug && nargs == 3); 
/* PARSE THE INPUT */ 
/* get first atom */ 
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if (!(atom_exp_list = SYB_EXPR_ANALYZE( SYB_EXPR_GET_ATOM_TOKEN, 

args[0], 

&fina]jx)s, &area_num ))) 

goto error; 

5 if (!(ml = SYB_AREA_GET_MOLECULE (area_num))) 
goto cleanup; 

if (!(atom_setl = SYB_ATOM_FIND_SET ( ml, atom_exp_list))) 

goto error; 
if( atom_exp_list) 
10 SYB_EXPR_DELETE_RPN_LIST( atom_exp_list); 

atom_exp_list = (List_Ptr) NIL; 

uTL_SET_CARDINALITY(atom_setl))) { 

UIMS2_WRITE_ERR0R( 
"Error: First argument must be only one atom\n"); 
15 goto error; 

If (Karecl = SYB_ATOM_FIND_REC (ml, UTL_SET_NEXT (atom_setl, -1)) )) goto 

error; 

al == arecl->recno; 
20 UTL_SET_DESTROY( atom_setl ); 
atom_setl = NIL; 
/* get 2nd atom */ 

if (!<aton,.expJis. = SYB_EXPR_ANALYZE( SYB_EXPR_GET_ATOMTOKEN, 
args[l], 

25 &final_pos, &area_num ))) 

goto error; 

if (!(m2 = SYB_AREA_GET_MOLECULE (area_num))) 
goto cleanup; 

if (!(end_atoms = SYB_ATOM_FIND_SET ( m2, atom_exp_list))) 
30 goto error; 

if( atom_exp_list) 

S YB_EXPR_DELETE_RPN_LIST( atom_exp_list) ; 

atom_exp_list = (List_Ptr) NIL; 
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if (ml != m2) { 

UIMS2_WRITE_ERROR( 
"Error: atoms must be in the same molecule\n"); 

goto error; 

^ |f(!(l == uTL_SET_CARDINALrrY(end_atoms))) { 
UIMS2_WRITE_ERR0R( 
"Error: Second argument must be only one atom\n"); 

goto error; 

10 } 



' / 1 TTTT <:pt next (end atoms, -1)) )) goto 

if (!(arec2 = SYB_ATOM_FIND_REC (ml, UTL_SET_NEXi ( _ 



error; 

a2 = arec2->recno; 

/* get 3rci atom */ 



^'""'^ ^ , ,YB EXPR ANALYZE( SYB EXPR_GET_ATOM_TOKEN, 

if (!(atom_exp_list = SYB_EXPR_anali v 

args[2], 

&fmal_pos, &area_num ))) 
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goto error; 

if (!(m2 = SYB_AREA_GET_M0LECULE (area_num))) 
goto cleanup; 

if (!(atom_setl = SYB_ATOM_FIND_SET ( m2, atom.expjist))) 

goto error; 
if( atom_exp_list) 
25 SYB_EXPR_DELETE_RPN_LIST( atom_exp_list); 

atom_expJist = (List_Ptr) NIL; 

if (ml != m2 ) { 

UIMS2_WRITE_ERROR( 
••Error: atoms must be in the same molecule\n"); 

-jQ goto error; 

if (2 < UTL_SET_CARDINALlTY(atom_setl)) { 
UIMS2 WRITE_ERROR( 
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•■Error: Second argument must be no more than two atoms\n"); 
goto error; 

} 

a4 = a5 = -1; 
5 elem = UTL_SET_NEXT (atom_setl, -1); 

if (!(arec = SYB_ATOM_FIND_REC (ml, elem) )) goto error; 
a4 = arec->recno; 

if ((elem = UTL_SET_NEXT (atom_setl, elem) )!=-!){ 

if (!(arec = SYB_ATOM_FIND_REC (ml, elem) )) goto error; 
10 a5 = arec -> recno; 

} 

UTL_SET_DESTROY( atom_setl ); 
atom_setl = NIL; 

} 

15 /* GENERATE the paths */ 
/* set up paths */ 

if (t(a2chk = UTL_SET_CREATE( ml->max_atoms + 1 ) )) goto error, 
if (.(nuls = UTL SET_CREATE( ml->max_atoms + 1 ) )) goto error; 
if (.(cnats = UTL SET_CREATE( ml-> max.atoms + 1 ) )) goto error; 
if 0(nxcn = UTL SET CREATE( ml-> max_atoms + 1 ) )) goto error; 
if (.(scratch = UTL_SET_CREATE( ml->max_atoms + 1 ) )) goto error, 
if (!syb_mgen_conn_att_atoms( a2chk, ml, al )) goto error; 
if (!UTL_SET_MEMBER( a2chk, a2 )) { 
UIMS2_WRITE_ERR0R ( 
"Erron second argument atom is not bonded to first argument atomAn ), 

goto error; 

} 

UTL_SET_DELETE( a2chk, a2 ); 
a = -I; 

11 (np < MAX NP 8.8c (a = UTL_SET_NEXT( a2chk, a)) > = 0 ) ( 

if (!(p[np].path'= UTL_SET_CREATE( ml- > max_atoms + 1 ) )) goto error; 
p[np].root = a; 
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10 



15 



p[np].nrings = p[np].done - 0; 
UTL SET_INSERT( p[np].path, a ); 

if (!(pinpl.a = SYB_ATOM HND.REC (ml. plnpl.roo.) )) gou, e,rcr; 

np+ + ; 



} 



/* grow the paths */ 
growing = TRUE; 
nats = 0; 
ncycles = 0; 
while (growing ) { 
nuats = 0; 

ringclosed = FALSE; 

for (pnow = 0; pnow < np; pnow-f + ) if (!p[pnow].done) { 
UTL_SET_COPY_INPLACE( cnats, p[pnow].path ); 
UTL_SET_CLEAR( nxcn ); 
elem = -1; 

/. accumnulate this generation of attached atoms into nxcn */ 

while ( (elem = UTL_SET_NEXT( cnats, elem)) > - 0 ) 1 
UTL_SET_CLEAR( nuls ); 

if (!syb_mgen_conn_att_atoms( nuls, ml, elem )) retum( FALSE ); 
UTL_SET_DELETE( nuls, al ); 
UTL_SET_DIFF_INPLACE( nuls, end_atoms, nuls ); 
UTL_SET_OR_INPLACE( nxcn, nuls, nxcn ); 
UTL SET_DIFF_INPLACE( nxcn, p[pnowl.path, nxcn ); 

25 } 

UTL SET_ORJNPLACE( p[pnow].path, nxcn, p[pnow].path ); 
if (toroot) if ((UTL SET_MEMBER( p[pnow].path, a4 )) 

U J > -I8.se UTL SET_MEMBER( p[pnow].path, a5 ))) p[pnowl.done 

TRUE; 

30 /* remove and mark ring closures when growing out */ 

.f (Uoroot) for (pdone = 0; pdone < np; pdone.. ) if (pdone = pnow) ( 
UTL SET ANDJNPLACE( p[pnow].path, p[pdone].path, a2chk ), 
if ((new_n'ngs = UTL_SET_CARDINALITY( a2chk ))) { 
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20 
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/* we have ring closure(s) */ 

p[pnow].nrings += new_rings; 
p[pdone].nrings += new_rings; 
ringclosed = TRUE; 

UTL_SET_OR_INPLACE( end_atoms, a2chk, end_atoms ); 
/. if pdone < pno^ two' brJ^ches are now same lengths, drop common atom from both; 
but if > , branches are different, and must avoid repeated closmg */ 
if (pdone < pnow) { 
/* remove atom(s) in the previous branch because paths are really same length 

UTL_SET_DIFF_INPLACE( p[pdone].path, a2chk, p[pdone].path ); 
UTL_SET_DIFF_INPLACE( p[pnow].path, a2chk, p[pnow].path ); 

} 

else { 

/. mus. ide..ify and n,ark each atom in axon that is amched ,o a2chk atom •/ 

elem = -1; 

while ( (elem = UTL_SET_NEXT( a2chk, elem)) > = 0 ) { 
UTL_SET_CLEAR( scratch ); 
if (!syb_mgen_conn_att_atoms( scratch, ml, elem )) 

retum( FALSE ); 
UTL_SET_AND_INPLACE( scratch, nxcn, scratch ); 
UTL_SET_OR_INPLACE( end_atoms, scratch, end_atoms ); 



} 



} 
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30 



} 



} 
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/* done growing paths if no more atoms added to any path .. 

for (pdone = 0, nuats = 0; pdone < np; pdone++ ) 

nuats += UTL_SET_CARDINAL1TY( p[pdone].path ); 

if (nuats< =nats && !ringclosed) growing = FALSE; 

nats = nuats; 
/* .. or after 100 atom layers out regardless */ 

ncycles+ + ; 
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if (ncycles > = 100) growing = FALSE; 

} 

/* debugging */ 

if (debug) for (pdone = 0; pdone < np; pdone++) { 
5 sprintf( tempString, "Path %d (%d rings, from %d): 

pdone+1, p[pdone].nrings, p[pdone].root ); 
UBS_OUTPUT_MESSAGE( stdout, tempString ); 
ashow( p[pdone].path, ml ); 

} 

10 /* compute the path properties */ 

for (pdone = 0; pdone < np; pdone++) ( 

p[pdone].chosen = toroot && (UTL_SET_MEMBER(p[pdonel.path. a4) 

1 1 ( a5 > -1 && UTL_SET_MEMBER( p[pdone].path, a5 ))); 
p[pdone].nats = UTL_SET_CARDINAL1TY( p[pdone].path ); 
15 p[pdone].nrings = p[pdone].nrings ? 1 : 0; 

p[pdone].mw = 0.0; 
p[pdone].done = 0; 

} 

/* return all root atoms, ordered best to worst */ 
20 for (p2do = 0; p2do < np; p2do++ ) { 

for (pdone = 0; pdone < np; pdone+-f ) if (!p[pdone].done) { 

best = pdone; 
break; 

25 Lr (pdone = 0; pdone < np; pdone. -.) if (!p[pdone].done pdone I = best) { 
if (!p[bestl.chosen && p[pdone]. chosen) best - pdone; 
if (p[best]. chosen == p[pdone]. chosen) { 
if (p[pdone].nrings && !p[best].nrings) best = pdone; 
else if (('p[best].chosen && (p[pdone].nats > p[best].nats)) 1 1 
30 (p[best].chosen && (p[pdone].nats < p[best].nats))) best = pdone; 

else if (p[pdone].nats == p[best].nats) { 

p[pdone].mw = get_path_mw( p[pdone].path, ml, p[pdone].mw ); 
p[best].mw = get_path_mw( p[best].path, ml, p[bestl.mw ); 
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if (p[pdone].mw > p[best].mw) best = pdone; 
else if (p[pdone].mw == p[best].mw) { 
I* checking relative geometries of attachments via "improper" torsion *l 

the phenyl ether problem - if candidates are 180 degrees apart and we are on the 

if (toroot) { 

/* are we 180 apart? */ 

if (!( a4rec = SYB_ATOM_FIND_REC (ml, a4 )) ) goto error; 
potl = UTL_GEOM_TAU ( a4rec->xyz, arecl->xyz, arec2->xyz, 

10 prbest].a->xyz ); 

pot2 = UTL_GEOM_TAU ( a4rec- > xyz, arecl- > xyz, arec2- > xyz, 

p[pdone].a->xyz ); 

podiff = potl - pot2; 
while (podiff < 0.0) podiff += 360.0; 
j5 while (pot2 < 0.0) pot2 += 360.0; 

if (podiff < 190.0 && podiff > 170.0 && pot2 < 180.0) 
best = pdone; 

} 

if (best ! = pdone) ( 
20 I* if not already set, according to the previous special case, then */ 

I* if torsions differ by 360 degrees then we have trans, prefer the + 180 */ 

tl = UTL_GEOM_TAU ( p[pdone].a->xyz, arecl- > xyz, arec2->xyz, 

prbest].a->xyz ); 

a = UTL_GEOM_TAU ( p[best].a-> xyz, arecl->xyz, arec2->xyz, 

25 p[pdone].a->xyz ); 

diff = tl - 12; 

if (diff > 355.0) best = pdone; 
else if (diff > -355.0) { 

while (tl < 0.0) tl + = 360.0; 
3Q if (tl > 170.0 && tl < = 350.0) best = pdone; 

) 

} 

} 
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arec 



} 
} 
} 

= sYB_ATOM_FIND_REC( ml, p[bestl.root ); 
5 sprintf(tempString,"%d arec->id ); 
if(!(*Writer)(tempString)) goto error; 
p[best].done = TRUE; 

} 

retval = TRUE; 
10 error: 
cleanup: 

if( atom_exp_Iist) 

SYB EXPR_DELETE_RPN_LIST( atom_exp_list) 

if(atom_setl) 
15 uTL_SET_DESTROY(atom_setl); 

if(end_alonis) 

UTL_SET_DESTROY(end_atoms); 

if(a2chk) 

UTL_SET_DESTROY(a2chk); 

20 if(nuls) 

UTL_SET_DESTRO Y(nu 1 s) ; 

if(nxcn) 

UTL_SET_DESTROY(nxcn) ; 

if(cnats) 

25 uTL_SET_DESTROY(cnats); 

if(scratch) 

UTL_SET_DESTROY(scratch); 

retum( retval ); 

30 Ltic int syb_mgen_conn_att_atoms( aset, m, atid ) 
/* ors atoms attached to atm into aset */ 
/* WORKS STRUCTLY WITH RECNOS *l 
set_ptr aset; 
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moljJtr m; 
int aud; 
{ 

atomjJtr at; 
5 List_Ptr tohs; 
atom_ptr toh; 
acon_ptr connl; 
unsigned nbytesl; 

at = SYB_AT0M_FIND_REC( m, atid ); 

10 tohs = at->conn_atom; 

while (tohs) { , p u t»cn- 

tohs = UTL_LIST_RETRIEVE_P( tohs, &connl, &nbytesl), 

toh = SYB_ATOM_FIND_REC( m, connl->target ); 

UTL_SET_INSERT( aset, toh->recno ); 

15 } 

retum( TRUE ); 

Ltic float get_path_mw( aset, m. mw ) 
I* returns the total atomic weight of all atoms in aset */ 
20 set__ptr aset; 
molj)tr m; 
float mw; 
( 

int elem = -1*, 
25 float ans = 0.0; 
atomjJtr at; 
if (mw) retum( mw ); 

elem = -1; _ a ^ / 

while ( (elem = UTL_SET_NEXT( asel, elem)) >-0)\ 

30 at = SYB_ATOM_FIND_REC( m, elem ); 

-f = (float) SYB.ATAB_ATOMIC_WEIGHT( at- > type ); 

} 

retum( ans ); 
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static void ashow( aset, m ) 

for interactive debugging, shows a set's membership in terms of atom ID 



set_ptr aset; 
5 mol_ptr m; 

{ 

char buffllOOO], *b; 
atom_ptr at; 
int elem; 
10 *buff = '\0'; 
b = buff; 
elem = -1; 

while ( (elem = UTL_SET_NEXT( aset, elem)) > = 0 ) { 
at = SYB_ATOM_FIND_REC( m, elem ); 
15 sprintf(b, " %d", at->id); 

b = buff + strlen( buff ); 

} 

sprintf( b, "\n" ); 

UBS_OUTPUT_MESSAGE( stdout, buff); 

20 } 
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/* BEGINNING OF SUBROUTINES I-D. Calculation of attenuated fields */ 
/ * + E: QS AR_FIELD_EV AL_RB_ATrENO*/ 



********************** 



********************** 



*/ 

*/ 

^ /* int QSAR_FIELD_EVAL_RB_ATrEN( molp, stfldp, elfldp, regp, no_st, no_el, ctp ) 
*/ 

*/ 

/* 

*/ 

/* Dick Cramer May 13, 1995 

*/ 

10 /* 
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"Standard CoMFA" - except that the contribution of any atom 

to the field falls off with an inverse power of its distance 

from a root atom, measured in NUMBER OF ROTATABLE BONDS! 

This means also that each individual atom's contribution 
has a similarly scaled upper bound, rather than checking 
the upper bound only for the sum over all atoms. 



*/ 



I* This procedure computes vdW 6-12 steric values at each point in region */ 
10 /* and the electrostatic interactions (initially assuming 1/r dielectric). */ 



/* 

/* NOTE:: initially ignoring space averaging, other user knobs 
/* note:: assuming valid input here; error checking higher up ! 



20 



*/ 

/* 

*/ 

/* 

*/ 

/* Input: 

/* molp - molecule pointer, molecule to place in region. 
/* stfldp - SI c field pointer, where values will be placed. 
/* elfldp - electrostatic field pointer, where values will be placed. 
/* regp - region pointer, locations where values are to be evaluated. */ 

/* no_st - flag to skip steric evaluations 

I* no el - flag to skip electrostatic evaluations *' 
/* ctp - ComfaTopPtr, for dummy/lp values 



I* 



25 /* Returns 0 on failure, 1 otherwise. 



*/ 

/*+E:QSAR_FIELD_EVAL_RB_ATTENO*/ 
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*********************************** 



int QSAR_FIELD_EVAL_RB_ArrEN ( molp, stfldp, elfldp, regp , no_st, no_el, ctp) 

5 molj)tr molp; 

FieldPtr stfldp, elfldp; 
RegionPtr regp; 
int no_st, no_el ; 
ComfaTopPtr ctp; 

10 { 

BoxPtr box; 

atom_ptr at, SYB_ATOM_FIND_ID0; 
int pid, b, ix, iy, iz, nat, vol.avg, repulsive ; 
fpt *steric. *elect, SYB_ATAB_VDW_RADIIO ; 
15 fpt diff, dis, dis2, X, y, z, sum_steric, sum_elect ; 

fpt dis6, disl2 , repuls_val, offs[91[3], atm_ste, atm_ele; 

fpt *charge, *ctemp, *coord, *ftemp, *wt, scale_vol_avg, atm_stenc, atm_elect, 

int *atyp , *itemp, dohbd, dohba, ishbd, retval. dielectric , off, atid; 

Static fpt hbond_scal; 
20 fp. hbond A, hbond.B, -AWts = NTL. .QSAR.F.ELD.RB.WTSO; 

in, .HAs,".HDs. -HAp, -HDp; /• sets would be more efflc.en, but slower / 

int do^steric, do_elect; 

setjtr hdonor, SYB_HBOND_DONORS(), pset = NIL, aset = NIL; 



#define Q2KC 332.0 
25 #define MIN_SQ_DISTANCE 1 .Oe-4 
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/* — any atom within 10-2 Angstroms is hereby zapped ! 
this is about it: 10^6 / 10^-24 is close to overflow! */ 

ftemp = NIL; ctemp = NIL; itemp = NIL; retval = FALSE; HAs = NIL; HDs 
NIL; 

5 hdonor = NIL; 

/* for now, make root atom the one closest to 0,0,0 */ 
for (nat = 1; nat <= molp- > natoms; nat + +) { 
at = SYB_ATOM_FIND_ID( molp, nat ); 
dis2 = at->xyz[0] * at->xyz[0] + at->xyz[l] * at->xyz[l] + 
10 at- > xyz[2] * at- > xyz[2]; 

if (nat = = 1 II dis2 < dis) ( 
dis = dis2; 
atid = nat; 

} 

15 } 



/* following is specific to topomeric fields */ 

if (KAtWts = QSAR_FIELD_RB_WTS( molp, atid ) )) goto cleanup; 
if (!no_el) 

{dielectric = elfldp- > dielectric ; 
20 vol_avg = elfldp- >vol_avg_type; 

scale_vol_avg = elfldp- >scale_vol_avg; 
repulsive = elfldp- > repulsive; 

repuls_val=repexp[repulsive]; elect = elfldp -> field_value;} 
if (!no_st) 
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{vol_avg = stfldp->vol_avg_type; 
scale_vol_avg = slfldp- > scale_vol_avg; 
repulsive = stfldp- > repulsive; 

repuls_val=repexp[repulsive]; steric = stfldp -> field_value;} 

5 .f (.(ftemp = (fpt *) UTL MEM_ALLOC(3*sizeof(fpt)*molp->natoms))) goto cleanup; 
if (Kcte^p = (fpt UTlImEM.ALLOCC sizeof(fpt)*molp->natoms))) goto cleanup; 

.f (Kitemp = (int *) UTL_MEM_ALLOC( sizeof(int)*molp->natoms))) goto cleanup; 

if (.(HAS = (int ^) UTL_MEM_ALLOC( sizeof(int)*molp->natoms))) goto cleanup; 

if (!(HDs = (int *) UTL_MEM_ALLOC( sizeof(int)*molp-> natoms))) goto cleanup; 

10 /* get just those H's which are capable of Hbonding */ 

if (Khdonor = SYB_HBOND_DONORS( molp, NIL ) )) goto cleanup; 

for (coord=ftemp,atyp=itemp,charge=ctemp,HAp=HAs,HDp=HDs, nat=l; 
nat< =molp->natoms;nat++) 

{ if (NIL ==(at = SYB_ATOM_FIND_ID(molp, nat) ) ) goto cleanup; 
1 5 *coord + + = at- > xyz[0] ; 

*coord++ = at->xyz[l]; 

*coord++ = at->xyz[2]; 

*atyp++ = at- > type -1 ; 

*charge++ = at- > charge; 
20 *HAp+ + = SYB_ATAB_HBOND_ACCEPT(at- > type) ; 

*HDp + -f- = UTL_SET_MEMBER(hdonor , at- > recno) ; 

} 
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for (b=0; b<regp->n_boxes; b++) { 
box = & regp->box_array[b]; 

dohbd = (SYB_ATAB_ATOMIC_NUMBER(box->atom_type) ==!)&& 

(box- > pt_charge == 1.0); 
5 dohba = (S YB_ATAB_ATOMIC_NUMBER( box- > atomjype ) = = 8) && 

(box->pt_charge == -1.0); 

if (dohbd II dohba) { 

if (!TAILOR_STORE_IT_HERE( 
"TAILOR!FORCE_FIELD!HBOND_RAD_SCALING", 

&hbond_scal, 1)) goto cleanup; 
hbond_A = pow( hbond_scal, 6.0 ); 
hbond_B = hbond_A * hbond_A; 

} 
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if (vol_avg) 

FIELD_EVAL_GETOFF(offs,box- > stepsize,vol_avg,scale_vol_avg); 



15 QSAR 



if ( !no_st ) 

QSAR_FIELD_VDWTAB ( box -> atomjype, repuls_val, ctp- > du_lp_stenc ), 

for (iz=0, z=box->lo[21 ; iz < box->nstep[2]; iz+4-, z + = box- > stepsize[2]) 
for (iy=0. y=box->lo[l] ; iy < box->nstep[l]; iy-.-., y box->stepstze[l]) 
for (ix=0, x=box->lo[0] ; ix < box->nstep[0]; ix-f 4-, x box- > stepsize[0]) 

^for ( coord = ftemp, charge = ctemp, atyp = itemp, HAp=HAs, HDp^HDs, 
do_steric=TRUE, do_elect=TRUE, nat=0, sum_steric = sum_elect - 0.0, 
nat<molp->natoms; 



143 



nat+ + , wt++) 

•f ( ( *atyp == DUMMY-1 H *atyp == LP-1 ) !ctp->clu_lp_elect ) 
♦charge = 0.0; /* set charge lo 0 since ignoring Du/lp */ 
5 if (!vol_avg) /* the "normal" case */ 

{ 

dis2 = X - *coord++ ; 
dis2 *= dis2; 
diff = y - *coord++ ; 
10 diff *= diff; 

dis2 += diff; 

diff = z - *coord++ ; 

diff *= diff; 

dis2 + = diff; 

15 if ( !no_el && elfldp- > zap_el= =2 && do_elect) ( 

dis = sqrt( dis2 ); 

if ( dis < SYB_ATAB_VDW_RADII( *atyp+l ) ) { 



/* no shortcircuits! */ 
/* 

*elect++ = 0.0; 
do elect = FALSE; 



20 



} 

} 



25 if ( dis2 < M1N_SQ_DISTANCE ) { 

if ( !no_st ) 

/* if atom has no steric value, we don't care about 
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MIN_SQ_DISTANCE since it has no contribution anyway */ 
if ( vclJ_a[*atyp] != 0.0 && vclw_b[*atyp] != 0.0 ) { 
/* set sterics to its max value at current grid pt. */ 
atm_steric = (*wt) * stfldp- > max_value; 

} 

if ( !no_el && do_elect) { 

if ( !no_st && !do_steric && elfldp- > zap_el ) { 
*elect++ = DAB_F_MISSING; 
} 

else if ( *charge ! = 0.0 ) ( 
if ( *charge > 0.0 ) 

aim_elect = (*wt) * elfldp- > max_value; 
else atm_elect = (*wt) * -elfldp- > max_value; 

} 

15 } 

if ( !do_elect 8l8l !do_steric ) 
break; /* break out of loop since neither el. or st. 

need to be calculated for this grid point */ 



10 



20 



/* setting dis2 to 1 (an arbitrary no.) will prevent a zero 

divide in the sum_steric or sum_elect calculations below */ 
dis2 = 1.0; 
} 



if ( ! no_st && do_steric ) ( 
25 dis6 = dis2 * dis2 * dis2; 

disl2= dis6 * dis6 ; 
if (repulsive) 

disl2 = (repulsive= = l) ? disl2 / dis2 : disl2 / dis2 / dis2; 
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if (dohbd && *HAp) 

atm_steric = hbond_B * vdw_b[*atypl/disl2 - 
hbond_A * vdw_a[*atyp]/dis6 ; 

else if (dohba && *HDp) 

atm_steric = hbond_B * vdw_b[*atyp]/disl2 - 
hbond_A * vdw_a[*atyp]/dis6 ; 

else 

atm_steric = vdw_b[*atyp]/disl2 - vdw_a[*atyp]/dis6 ; 
HAp+ + ;HDp+ + ; } 

atm_steric = atm.steric > stfldp- > max_value ? stfldp- > max_value 

: atm_steric; 
atm_steric *= (*wt); 

if ( ! no_el && do_elect ) { 
atm_elect = *charge++ / 

( dielectric ? sqrt(dis2) : dis2 ) ; 
atm.elect = atm_elect > elfldp- > max.value ? elfldp- > max_value 

: atm_elect; 

atm_elect ="atm_elect < -(elfldp- > max.value) ? -(elfldp- > max_value) 

: atm_elect; 
atm_elect *= (*wt); 
sum_elect + = atm_elect; 

} 



atyp++; 

sum_steric += atm_steric; 

} 

else 

( for (off=0;off<9;off++) 

( 
} 
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coord + = 3; 
aiyp ++ ; 
charge ++ ; 
HAp ++ ; 
HDp + + ; 
} 

} /* atom loop */ 



doneatoms: 

if ( do_steric 1| do_elect ) { 
if (vol_avg) { sum_elect /= 9.0; sum_steric /= 9.0 ; } 
if ( !no_el && do_elect ) 

{ *elect = sum_elect * box-> pt_charge * Q2KC ; 
if ( *elect > emdp->max_value ) *elect = elfldp- > max_value; 
else if ( *elect < - elfldp- > max_value ) *elect = 
- elfldp- > max_value; 
transform_field(elfldp- > max_value,elect,ctp); 

elect -1--^-; 

} 

if ( !no_st &&. do_steric ) 
{ *steric = sum_steric ; 
if ( *steric > stfldp- > max_value) 
{ *steric = stfldp- >max_value; 

if (!no_el && elfldp- >zap_el= = l ) *(elect-l) = DAB_F_MISSING; } 
transform_field(stfldp- > max_value,steric,ctp); 
steric + ; } 

} 

} /* points in box loop */ 
} /* boxes loop */ 
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retval = TRUE; 
cleanup: 

if ( iiemp) UTL_MEM_FREE( itemp); 
if ( ftemp) UTL_MEM_FREE( ftemp); 
5 if ( ctemp) UTL_MEM_FREE( ctemp); 
if (HAS) UTL_MEM_FREE( HAs ); 
if (HDs) UTL_MEM_FREE( HDs ); 
if (hdonor) UTL_SET_DESTROY( hdonor ); 
if (AtWts) UTL_MEM_FREE( AtWts ); 
10 if (pset) UTL_MEM_FREE( pset ); 

if (aset) UTL_MEM_FREE( aset ); 
return retval; 
#undef Q2KC 

#undef MIN_SQ_DISTANCE 
15 } 



/* 

static fpt *QSAR_FIELD_RB_WTS( molp, rootid ) 

/* generates rotational-bond wts for each atom */ 
mol_ptr molp; 
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int rootid; 
{ 

/* pseudo code for FIELD_RB_WTSO 



5 while saw new atoms 

uncover atoms that stopped last shell growth 
grow next "rotational shell" 
while adding to shell 
for each atom in shell 
get neighbors not seen 
for each neighbor 

if bond is rotatable (acyclic, > 1 attached atom, not =,am,#) 

cover all other atoms attached to atom for this shell 
add it to shell 

15 */ 

fpt *ansr = NIL, *vals = NIL, factor, nowfact = LO; 
int found, aggcount, atid, aggid, loop, size; 

set_ptr aggats = NIL, allats = NIL, nuls = NIL, endatms = NIL, end.cands 
atom_ptr root, SYB_ATOM_FIND_REC0, at, atrec ; 
20 bond_ptr b, SYB_BOND_FIND_REC0; 

List_Ptr toats, UTL_LIST_RETRIEVE_PO; 
acon_ptr cptr; 

char tempString[200]; 

void ashowO, qsar_field_attached_atoms(); 

25 If (! ( vals = (fpt *) UTL_MEM_ALLOC( sizeof(fpt)*molp- > natoms))) retum( NI 
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if (!UIMS2_VAR_GET_T0KEN( "TAILOR! COMF A! AGGREG_DESCALE", 
&factor ) ) retum( NIL ); 

if (Kallats = UTL_SET_CREATE( molp- > max_atoms + 1 ) )) goto cleanup; 

if (Kaggats = UTL_SET_CREATE( molp- > max_atoms + 1 ) )) goto cleanup; 

5 if (!(nuls = UTL_SET_CREATE( molp- > naax.atoms + 1 ) )) goto cleanup; 

if (Kendatms = UTL_SET_CREATE( molp- > max_atoms + I) )) goto cleanup; 

if (!(end_cands = UTL_SET_CREATE( molp- > max.atoms + 1 ) )) goto cleanup; 

if (!( root = SYB_ATOM_FIND_REC( molp, rootid ) )) goto cleanup; 

UTL_SET_INSERT( aggats, root->recno ); 
10 UTL_SET_INSERT( allats, root-> recno ); 
aggcount = loop = 1; 
while (TRUE) { 

while (TRUE) { 
aggid = -1; 

15 while ((aggid = UTL_SET_NEXT( allats, aggid )) > = 0 ) { 

UTL_SET_CLEAR( nuls ); 

qsar_field_attached_atoms( nuls, molp, aggid ); 

UTL_SET_DIFF_INPLACE( nuls, allats, nuls ); 

UTL SET DIFF_INPLACE( nuls, endatms, nuls ); 
20 /* identifying any atoms that terminate this aggregate */ 

atid = -1; 

while ((atid = UTL_SET_NEXT( nuls, atid )) > = 0 ) { 
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if (!( at = SYB_ATOM_FIND_REC( molp, atid ) )) goto cleanup; 

/* skipping monovalent atoms */ 

if (at->nbond > 1) { 

/* find bond record that attaches to aggid */ 
5 toats = at->conn_atom; 

found = FALSE; 

while (toats &.& '.found ) ( 

toats = UTL_LIST_RETRIEVE_P( toats, &cptr, &size ); 
found = (cptr-> target == aggid ); 

10 } 

if (! found) goto cleanup; 

b = SYB_BOND_FIND_REC (molp, cptr- >bond_rec); 
if ( !(b->status & BOND_V_IRING) && !(b->status & BOND_V_ERI 
&8c (b->type == SYB_BTAB_MNEM_TO_TYPEC'r') ) ) { 
15 /* have an end-of-aggregate atom, mark as end atoms all other attached atoms */ 

UTL_SET_CLEAR( end_cands ); 

qsar_field_attached_atoms( end_cands, molp, at->recno ); 

UTL_SET_DELETE( end_cands, aggid ); 

UTL SET OR_INPLACE( endatms, end_cands, endatms ); 

20 } 
} 

} 

UTL_SET_OR_INPLACE( aggats, nuls, aggats ); 

} 

25 if (UTL_SET_CARDINALITY( aggats ) < = aggcount ) break; 

aggcount = UTL_SET_CARDINALITY( aggats ); 
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UTL_SET_OR_INPLACE( allats, aggats, allats ); 

} 

/* debugging stuff .. */ 

/* 

5 sprintf( tempString, "Aggregate %d (weight = %f ):", loop, nowfact ); 

UBS_OUTPUT_MESSAGE( stdout, tempString ); 
ashow( aggats, molp ); 

*/ 

/* if no atoms added, we are done! */ 
10 if (UTL_SET_EMPTY( aggats )) break; 

/* record scaling factor for atoms in this aggregate */ 
atid = -1; 

while ((atid = UTL_SET_NEXT( aggats, atid )) > = 0 ) { 

if (Katrec = SYB_ATOM_FIND_REC( molp, atid ))) goto cleanup; 
15 vals[ (atrec->id)-l ] = nowfact; 

) 

UTL_SET_OR_INPLACE( allats, aggats, allats ); 
UTL_SET_CLEAR( aggats ); 
UTL_SET_CLEAR( endatms ); 
20 aggcount = 0; 

nowfact *= factor; 
loop + + ; 

} 



ansr = vals; 
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cleanup: 

if (aggats) UTL_SET_DESTROY( aggats ); 
if (allats) UTL_SET_DESTROY( allats ); 
if (endatms) UTL_SET_DESTROY( endatms ); 
5 if (end_cands) UTL_SET_DESTROY( end_cands ); 
if (nuls) UTL_SET_DESTROY( nuls ); 
retum( ansr ); 

} 

static void qsar_field_attached_atoms( aset, m, atid ) 
10 /* ors atoms attached to atm into aset */ 

/* WORKS STRUCTLY WITH RECNOS */ 

set_ptr aset; 

mol_ptr m; 

int atid; 
15 { 

atom_ptr at, SYB_ATOM_FIND_ID0; 
List_Ptr tohs, UTL_LIST_RETRIEVE_PO; 
atomjJtr toh, SYB_ATOM_FIND_REC0; 
acon_ptr connl; 
20 intnbytesl; 

at = SYB_AT0M_F1ND_REC( m, atid ); 

tohs = at->conn_atom; 
while (tohs) { 

= UTL_LIST_RETRIEVE_P( tohs, &connl, &nbytesl); 
25 toh = SYB_ATOM_FIND_REC( m, connl- > target ); 

UTL_SET_INSERT( aset, toh->recno ); 

} 

return; 

} 





static void ashow( aset, m ) 

/* for interactive debugging, shows a set's 



embership in terms of atom ID 



setjptr aset; 
mol_ptr m; 



char buff[1000], *b; 



atom_ptr at, SYB_ATOM_FIND„REC0; 



int elem; 
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*buff = VO'; 
b = buff; 
elem = -1; 

while ( (elem = UTL_SET_NEXT( aset, elem)) > = 0 ) { 
at = SYB_ATOM_FIND_REC( m, elem ); 
sprintf(b, " %d\ at->id ); 
b = buff + strlen( buff); 

} 

sprintf( b, "\n" ); 

UBS OUTPUT_MESSAGE( stdout, buff ); 
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tf Section II-A. SPL invoked shell for compuung the diagonal defining the 
# "best" triangle, e.g., the one with the highest density of points below. 



5 @expression_generator LRT_FAST 

# Usage: 

# Irtjast rows descriptor_cols bio_col [pis flags like scaling in quotes] 

# rows (*) - rows to take 

# descriptor_cols - which columns are the neighborhood metrics 
10 # bio_col - which column has the bio (probably log bio) data 

# [...]- if need to SCAL NONE or anything like that, do it here 
# 

# returns a line of the form 

# 3.09691 / 0.000546509 = 5666.71 - 496 : 496 :: 15.6981 : 15.6989 
15 # ^ max bio difference 

# ^ optimal distance division for max bio 
^ ^ slope 

^ ^number in the Irt 

it 

"total number 

tt 

2Q ^ "area in the Irt 

"total area 

tt 

tt Significance is related to whether ratio of numbers is 

tt much above ratio of areas. 

tt 
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globalvar SAMPLS_IN_PROGRESS DONE_CHECKED_OUT 
localvar hold distname rows cols bio 
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setvar rows %promptif("$r' ROW_EXP "Rows to use in Irt") 

setvar cols %prompUf("$2" COL_EXP "COMFA*" "Columns of mol descriptors") 

setvar bio %promptif("$3" COL_EXP "LOGBIO" "Column of bio data") 

setvar hold SAMPLS_IN_PROGRESS 
5 setvar SAMPLS_IN_PROGRESS $bio 

setvar distname TAILOR !HIER!DIST_FNAME 
setvar TAILOR!HIER!DIST_FNAME lrt_fort.3 

tt here the information is computed and written to a file 
ft whose name is passed in via a TAILOR value 
10 QSAR ANA DO I >$NULLDEV $rows $cols HIER $4 1 

setvar SAMPLS_IN_PROGRESS $hold 

setvar TAILOR! HIER! DIST_FNAME $distname 

ft contents of the file are returned to the caller 
setvar hold %system("cat lrt_fort.3") 
15 %retum( "$hold" ) 



156 



# Section II-B. SPL script for computing the significance of the distribution 

# found by lrt_fast 
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@expression_generator dochi 

# computes the chi-square statistic for the number of points below 

# the diagonal, null hyptheses being the area fraction of the total. 

U To be called as: %dochi( %lrt_fast( ) ), i.e., its inputs 

tf are exactly the output of %lrt_fast as described in the Irtjast header. 



setvar expected %math( $9 * $11 / $13 ) 
setvar sq %math( $7 - $expected ) 
setvar sq %math( $sq * $sq / Sexpected ) 
15 %retum( $sq ) 
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/* Section II-C. Computes the best diagonal in the "virtual graph" of biological 
distances vs property differences. */ 

int QSHELL_HIER_LRT(table,biocol,dmat,nrow,order,lmsg) 
char *table; 

5 int biocol, /* column in MSS with biological data */ 
nrow, /* dimension of dmat and order */ 
♦order; /* array of row IDs to consider */ 
fpt *dmat; /* distance matrix for property distances */ 
char *lmsg; /* file name for results */ 
10 { 

fipt *p, *q, fabsO, bmax; 

int i,j, count, status_array; 

char *iipt_colname; 

FILE *out, *UTL_FILE_FOPEN0; 

15 /* need to get the bio values 

In the n^2 we can repack into n(n-l)/2 then add the n bio values 
and finish with the bio distances */ 

/* 

No error handling. Better be data in those rows! 

20 */ 

for (count=0, i=0; i<nrow; i++) 

for 0=0; j<i; j + 
dmat[count++] = dmat[i*nrow + j]; 



q = p = dmat + ( (nrow-1) * nrow) / 2; 
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TBL ACCESS_INDEX_TO_COLNAME(table, biocol-1, &fpt_colname); 

TBL_GRAB_INIT_FPTS(table, 1, &fpt_colname ); 

for ( i=0;i<nrow;i+ + , p++) 
TBL_GRAB_GET_FPTS_INV(order[i]-l, &status_array, p); 

5 TBL_GRAB_COMPLETE_FPTS0; 



bmax = 0.0; 

for (count=0, i=0; i<nrow; i++) 

for 0=0; j + + ' co"nt+-^) 
if ( (p[count] = fabs(q[i] - qO])) > bmax) bmax = p[count]; 

10 out = UTL_FILE_FOPEN(lmsg,"w"); 

QSHELL_HIER_DO_LRT(out,count,dmat,p, bmax); 

UTL_FILE_FCLOSE(out); 
} 

int QSHELL_HIER_DO_LRT( out, index, xsort, ysort, bmax ) 

15 FILE *out; 

fpt *xsort, *ysort, bmax; 

int index; 

( 

int *order, count, j, i, bad; 
20 int bestN, besti; 
fpt den,bestDen; 

//define CUTOFF ( bmax * ( xsort[order[i]] / xsort[orderO]] ) ) 
if (Korder = (int *) UTL_MEM_ALLOC( index *sizeof(int )))) return 0; 



159 



for (i=0;i<index;i++) order[i]=i; 
bestN = besti = bad = 0; 
bestDen = 0.0; 

fpt_heapsort(index, xsort, order); 

5 for (j=0;count=0, bad =0, j < index ;j + +) 

( 

if (xsort[order[j]] < = 0.0) continue; 

for (i=0;i<=j;i++) 

{ 

10 if (ysort[order[i]] < = CUTOFF) count+ + ; 

else bad+ + ; 

} /* loop over all d < = this distance */ 
if ( (den = county bmax / xsort[order[j]] *2.0) > bestDen) 
{bestDen = den; best! = j; bestN = index - bad;} 
15 ) /* loop over all distances */ 

den = bmax * xsort[order[index-l]]; 
sprintf(msg,"%g / %g = %g - %d : %d :: %g : %g\n", 
bmax,xsort[order[bestI]], bmax/xsort[order[bestI]], 
bestN, index, den-xsort[order[bestl]]*bmax/2.0, den); 
20 UBS_OUTPUT_MESSAGE(out,msg); 
UTL_MEM_FREE(order) ; 
return 1; 

} 



/* n is number of elements 
25 arrin is array of floats to be sorted 
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indx is array of ints initially 0...n-l 

*/ 

int fpt_heapsort(n,arrin,indx) 
int n; 
5 fpt *arrin; 
int *indx; 
{ 

int I, ir, indxt, i, j; 
fiptq; 

10 1 = n/2 ; 
ir = n -1 ; 

while (TRUE) /* the "10" loop */ 
{ 

if (1>0) { indxt = indx[--l]; q = arrin[indxt]; } 
15 else 
{ 

indxt = indx[ir]; q = arrin[indxt]; 
indx[ir--] = indx[0]; 
if ( ir = = 0 ) 

20 { indx[0] = indxt; return 1; } /*< = = = Only way 

} 

i = 1; 
i = 1; 

j = 1 + 1 +1; 
25 while 0 < = ir) ^he "20" loop */ 
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if ( (j<ir) && (arrin[indxG]] < arrin[indx(j + 1]]) ) j + + ; 
if (q < arrin[indxlj]]) { indx[i] = indx|j]; i = j; j = j+J^^; 
else { j = ^ 

5 } 

indx[i] = indxt; 

} 

} 
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/* SECTION III-A. Declarations for all non-standard data structures referenced 
in the C code functions shown in Sections I and II. */ 

I* Molecule and Supporting Structure Definitions */ 

*/ 

5 /* 

/* John McAlister 09-Aug-1985 *> 

/* This file contains the definitions for the molecular data struc- */ 
/* tures required within SYBYL. The contents of this file are des- */ 
10 /* described in detail in the document "SYBYL Molecular Data Struc- */ 

*/ 

/* tures". 



/ 



7 



I 



I* Define the molecule descriptor template 
15 typedef struct molecule_struct { 

char *name; /* pointer to molecule name 

132 type; /* molecule type 

List Ptr diet; /* list of dictionaries used with molecule */ 

132 status; /* molecule status 

20 char *comment; /* pointer to comment for molecule */ 

stamp cre_ume; /* creation time/user/version stamp */ 
stamp modjime; /* modification time/user/version stamp */ 

int max _props; /* maximum properties currently allocated */ 

int nprops; /* number of molecular properties */ 
25 List_Ptr props; /* pointer to list of properties */ 

int maxjeats; /* maximum features currently allocated */ 
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int nfeats; /* number of molecular features / 
List Ptr feats; /* pointer to list of molecular features */ 
int max_subst; /* maximum substructures currently allocated* 
int nsubst; /* number of substructures in molecule */ 

5 List Ptr subst; /* pointer to list of substructures */ 

List_Ptr subst_roots; /* pointer to list of root subst offsets */ 
int max_atoms; /* maximum atoms currently allocated 
int natoms; /* number of atoms in molecule */ 
List_Ptr atoms; /* pointer to atom array segment list */ 
10 int max_bonds; /* maximum bonds currently allocated 

int nbonds; /* number of bonds in molecule */ 
List Ptr bonds; /* pointer to bond array segment list */ 
int charges; /* type of atomic charges, if present */ 
fpt vectorp]; /* translation vector for molecule */ 

15 fpt matrix[9]; /* rotation matrix for molecule */ 

List_Ptr assoc_data; /* pointer to list of associated data */ 

/* descriptors *l 
} molecule, *mol_ptr; 
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************************* ATOM DEFINITION 

*/ 



/ 

*******************************/ 

/* 



/* Define the atom entry record template 
5 typedef struct atom_struct { 



15 



char *name; /* atom name 

int type; /* atom type *' 



recno; /* cumulative atom record number */ 



i32 status; /* atom status 
int 

10 int id; /* atom id (logical atom number) */ 

int link; /* link to next atom record */ 
int subst; /* offset to substructure containing atom */ 
List_Ptr property; /* pointer to list of properties for atom */ 
List Ptr feature; /* pointer to list of features including */ 

/* this atom *l 
int nbond; /* number of bonds involving this atom */ 
List_Ptr conn_atom; /* pointer to list of bonded atoms */ 
fpt xyz[3]; /* coordinates of atom *l 
fpt charge; /* point charge on atom 
20 } atom, *atom_ptr; 
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/* Define the atom array segment descriptor template 
typedef struct atom_seg_struct { 



*/ 



atom_ptr seg_head; /* pointer to head of atom array segment 
mol_ptr molecule; /* pointer to molecule containing atom seg */ 
25 int max_atom; /* maximum number of atom records in seg */ 
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int natom; /* number of filled atom records in sag */ 
int used_atom; /* offset to first filled record in segment */ 
int free_atom; /* offset to first free record in segment */ 
} atom_seg, *aseg_ptr; 

5 /* Define the bond specifier records pointed to by the atom records */ 
typedef struct atom_conn_struct { 



int target; /* offset to target atom 

int bond_rec; /* offset to bond descriptor record 

} atom_conn, *acon_ptr; 
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BOND DEFINITION 

/* Define the bond entry record template 
5 typedef struct bond^struct { 



int 


type; 


/* bond type */ 




i32 


status; 


/* bond status */ 




int 


recno; 


/* cumulative bond record number 




int 


id; 


/* bond id (logical bond number) 


*/ 


int 


link; 


/* link to empty bond record 


*/ 


List_ 


Ptr property; /* pointer to bond property list 


*/ 


List_ 


Ptr feature; 


/* pointer to list of features including 


*/ 






/* this bond */ 




int 


o_subst; 


/* offset to origin atom substructure 


*/ 


int 


origin; 


/* offset to atom at bond origin 


*/ 


int 


t_subst; 


/* offset to target atom substructure 


*/ 


int 


target; 


/* offset to atom at bond destination 


*/ 



*/ 



} bond, *bond__ptr; 




/* Define the bond array segment descriptor template 
20 typedef struct bond_seg_struct { 



bond_j)tr seg_head; /* pointer to head of bond array segment */ 
mol__ptr molecule; /* pointer to molecule containing bond seg */ 
int max_bond; /* maximum number of bonds in segment 
int nbond; /* number of filled bond records in seg */ 
25 int used^bond; /* offset to first filled record in segment */ 

int free_bond; /* offset to first free record in segment */ 
} bond_seg, *bseg_ptr; 
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/* = = = = = comfa.h = = = 



************* 



evaluations are made */ 



/* Regions are the set of points at which energy 

^ n^iAR A reeion is defined as the union */ 
/* in the CoMFA method of QSAK. a region 

5 /* of a set of 3D boxes (which may be a single point in the */ 
limit) and their associated attributes. Attributes needed for */ 
/* CoMFA purposes are outlined below. 



• 



10 #ifndef 
#define 
#include 
#define 
#define 



QSAR_COMFA_DEFINITIONS 

QSAR_COMFA_DEFINITIONS 1 

"ujypes.h" 

DUMMY 26 /* dummy atom id */ 
LP 20 /* lone pair atom id */ 



15 typedef enum { 

FDENGY_UNKNOWN, 
FDENGY_ELECT, 
FDENGY_STERIC, 
FDENGY_HOMO, 
20 FDENGY_LUMO, 
DOCK_ELECT, 
DOCK_STA_NOHB, 
DOCK_STA_HBD, 
DOCK_STA_HBA, 
25 DOCK_STB_NOHB, 
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DOCK_STB_HBD, 
DOCK_STB_HBA } FldEngyTyp; 

typedef enum { 

FDHD_ORIGINAL, 
5 FDHD_FFIT, 
FDHD_XTERN, 
FDHD_FUNC, 
FDHD_USER, 
FDHD_USR_AVG, 
10 FDHD_DOCK, 
FDHD_AVG, 
FDHD_SIG, 
FDHD_MAX, 
FDHD_MIN, 
15 FDHD_COEFF, 
FDHD_AVG_X, 
FDHD_SIG_X, 
FDHD_FLD_X, 
FDHD_RANGE, 
20 FDHD_PLS_XWT, 

FDHD_PLS_XLOAD, 
FDHD_FAC_LOAD, 
FDHD_FAC_COMM, 
FDHD_FAC_ROTLOAD, 
25 FDHD_SIMCA_LOAD, 
FDHD_SIMCA_MODEL, 
FDHD SIMCA_DISCRIM, 
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FDHD_HBD } FldHowTyp; 
typedef struct { 

fpt lo[3], /* comer with lowest values for each axis */ 
hi[3],' /* " hi-est ^ 
stepsize[3]; /* increment between points *^ 

int nslepPL /* derived as 1 + (hi-lo + epsilon) / stepsize V 



n- /* n = product of nstep[i] ' 

int atomjype; /* SYBYL atom type, for steric energy computation */ 
fpt pt_charge; I* elemental charge at point, for electrostatics */ 
10 fpt ^weight; I* weightW is applied in all computations,e.g= 1 */ 
int avgjype; /* box of 'scale', sphere, sphere x vdw, ...? */ 
fpt avg_scaie; /* scale whose meaning derived from avgjype */ 
int arb, /* arbitrary int for later use */ 

*parb; /* " Pointer " " *' 

15 } Box, *BoxPtr ; 



typedef struct { 

char *filename ; /* name of the region's file (if any) */ 
int n_boxes; /* number of boxes which make up the region 
int n_points ; /* number of points in this region altogether */ 
20 BoxPtr box_array; /* box_array[n_regions], each one a Box 
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■I 



int n refs ; /* number of CURRENT references to this memory */ 
made; /* creation sti 
} Region, *RegionPtr ; 



long when_made; /* creation stamp *' 



typedef struct { 
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10 



15 



I 



char *reg_name; /* name of the region's file (if any) 
char *fld_name; /* name of this field's file (if any) 
RegionPtr reference; /* the region referenced by this field 
FldEngyTyp fid; /* what type of field is referenced here 
/* number of fields averaged into this one 
/* number of iterations in current field fit run */ 
/* unspecified molecule id, 
e.g. dbname/molname/alignname 
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int num_avgd; 
int curr_iter; 
char *mol_id; 



*/ 



int n_points ; 
int zap_el; 
fpt max_value; 
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/* number of points in associated region 
/* whether electrostatics are MISSING when>max_st */ 
/* largest permitted absolute value of energy */ 
fpt *field_value; /* values at each point of the field */ 
int n refs ; /* number of CURRENT references to this memory 
/* creation stamp 

/* added these 4 items 1/30/89 DEP */ 



long when_made; 
int vol_avg_type; 
fpt scale_vol_avg; 
int dielectric; 
int repulsive; 



FldHowTyp how_made; 
20 } Field, *FieldPtr ; 



/* perry's way = 1 or old way = 0 */ 



/* molecule dependent information solicited by QSAR table operations, 
passed into COMFA column field evaluations *l 



■I 



typedef struct { 

boolean already_field; /* whether a field name exists (otherwise alignment) */ 
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10 



char *some_name; /* name of alignment; Nil align == use as is (!) */ 
char *steric_name; /* name of steric field (if applicable) */ 
char *elect_name; /* name of electrostatic field (if applicable) 
FieldPtr sfld_p; /* points to steric field in memory (when there) 
5 FieldPtr efld_p; /* points to elect, field in memory (when there) 
} ComfaMol, *ComfaMolPtr; 

/* molecule-independent information for CoMFA evaluations */ 
typedef struct ( 



int vol_avg ; /* case for volume averaging: 0,1, 2=none,box,sphere(0)*/ 
fpt vol_scale ; /* scale for volume averaging (1.0) */ 
int fld_types; /* case for what fields: 0.1.2=both,steric,elect.(0) */ 



fpt steric_max; /* maximum steric energy (30) 

int repulsive ; /* steric repulsive exponent - 12,10,or 8 (12) */ 
fpt elect_max ; /* maximum electrostatic energy (30) */ 

15 int dielectric; /* case for dielectric (AS FORCE FIELD TAILOR) 

int elect_out ; /* case to drop elect inside steric max: 0,1 =T,F (1) *l 

char *region_name; I* name of region used in the CoMFA computations 

FieldPtr sweight_fld; /* points to MEMORY field for weighting steric PLS 
FieldPtr eweight_nd; /* points to MEMORY field for weighting elect. PLS 
20 FldHowTyp how_done; /* perry's way = 1 or old way = 0 */ 
int dujp_steric; /* include dummies and lone pairs in steric field 
calculations */ 

int du_lp_elect; /* include dummies and lone pairs in electrostatic 
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field calculations */ 

int sparel; /* As of 6. Icomfa . this is TAILOR!COMFA!TRANSFORM*/ 

int spare2; /* INDICATOR SCALE among other things */ 

} ComfaTop, *ComfaTopPtr; 



5 #endif 
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Section III-B. Functional descriptions of external procedures. (Routines that simply return 
dynamic memory to the heap are not described.) 

BOND_V_ERING - TRUE if bond is in an external ring. 
BOND_V_IRING - TRUE if bond is in an internal (simple) ring. 

5 QSAR_FIELD_EVAL_GETOFF - provides coordinates for field computation when 
"volume averaging" is being done. 

QSAR_FIELD_VDWTAB - returns steric parameters for the computation of the field 
contribution from the probe atom and each of the molecule atoms. 

SYB AREA_GET_MOLECULE - returns the internal representation of the molecule in 
10 some area or "container", if such exists. 

SYB_ATAB_ATOMIC_NUMBER - returns the atomic number of the specified atom type. 

SYB_ATAB_ATOMIC_WEIGHT - returns the atomic weight of the specified atom type. 

SYB_ATAB_HBOND_ACCEPT - returns TRUE if the specified atomic type is a 
hydrogen-bond accepting atom. 

15 SYB_ATAB_VDW_RADII - returns the atomic radius of the specified atomic type. 

SYB_ATOM_FIND_ID - returns the internal representation of an atom referenced by its 
atom'lD num'ber (A^om IDs are guaranteed to be continuous but the ID of any single atom 
may change as atoms are added or deleted.) 
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SYB ATOM FIND REC - returns the internal representaUon of an atom referenced by its 
recor'd ID number. (~Atom record IDs are invariant but there may be "holes" in their 
sequence such that the largest record ID may be greater than the number of atoms.) 

SYB_ATOM_FIND_SET - returns the bitset of atoms corresponding to a list of atoms. 

5 SYB_BOND_FIND_REC - returns the internal representation of a bond referenced by its 
(invariant) record ID number. 

SYB_BTAB_MNEM_TO_TYPE - converts an ASCII representation of a bond type to its 
internal representation. 

SYB_EXPR_ANALYZE - parses a user-entered ASCII description of atoms (e.g., 
10 M2(<H»'for all hydrogen atoms within molecule M2) into internally valid 
representations of molecule and atoms. 

SYB_HBOND_DONORS - returns the set of IDs for atoms which are hydrogen-bonding 
hydrogens. 

TAILOR_STORE_IT_HERE - returns the current value of a user- (and SPL-) accessible 
15 variable. 

TBL_ACCESS_INDEX_TO_COLNAME - converts a user-provided MSS column ID to a 
column name (name is guaranteed to be a unique identifier). 

TBL_GRAB_COMPLETE_FPTS - done returning multiple (scalar) values in an MSS 
column to an array. 

20 

TBL GRAB GET_FPTS_INV - in a multiple value retrieval, returns the value 
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corresponding to a user-provided row ID. 

TBL_GRAB_INIT_FPTS - set up for returning multiple (scalar) values in an MSS column 
to an array. 

UBS_OUTPUT_MESSAGE - equivalent to JprintfO 
5 UIMS2_VAR_GET_TOKEN - returns the current value of a global SPL variable. 
UlMS2_WRITE_ERROR - writes text to the error output stream. 
UTL_FILE_FCLOSE, UTL_FILE_FOPEN - equivalem io fcloseQ m^fopenQ. 
UTL_LIST_RETRIEVE - returns the next element on a linked list. 
UTL_MEM_ALLOC - equivalent to mallocQ. 

10 UTL SET AND INPLACE - makes the first set logically equivalent to the second set. 
with only those bits that are also 1 in the third set becoming 1 in the first set. 

UTL_SET_CARDINALITY - returns the number of bits that are 1 in a particular bitset. 
UTL SET_CLEAR - sets all bits in the set to 0. 

UTL SET_COPY_INPLACE - makes the first set logically identical to the second. 
15 UTL_SET_CREATE - creates and returns an empty set of requested size. 
UTL SET_DELETE - sets the specified bit to 0. 
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UTL SET_DIFF_INPLACE - makes the first set logically equivalent to the second set, 
with all bits that are 1 in the third set becoming 0 in the first set. 

UTL_SET_EMPTY - TRUE if all bits in the set are 0. 

UTL_SET_INSERT - sets the requested bit to 1. 

5 UTL_SET_MEMBER - returns TRUE if the requested set bit equals 1 . 

UTL_SET_NEXT - returns the identity of the next non-zero bit in a set. 

UTL_SET_OR_INPLACE - makes the first set logically equivalent to the second set, with 
all bits that are 1 in the third set becoming 1 in the first set. 

UTL STR CMP_NOCASE - non-case sensitive version of strcmpO- 
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APPENDIX "B" 

^ PHORE LOG column type and 
/* CODE. This code implements a PHORE_i. 

calculates a single molecule) 
cell value (the Hydrogen Bonding Fxngerpr.nt for 

5 within the SYBYL understood that other 

Molecular Spreadsheet. ^ ^.^^ ^.^^ 

supporting code handles user input, user 

'rdall structure for PHORE_LOC column type */ 
10 typedef 

struct PHORE { feature file - 

^^'^r.^ ^rx' /* user name for Dibcu reau^i- 
char *disco_rn, / ^^^^ 

default 

appears below */ ^^^^ ^^^^^ f,,t„re file 

j5 int disco__in, / 

loaded */ Hpfininq region file 

char *region_fn; /* user name for defining 

*^ . ran. /* internal reference to region when 

RegionPtr rgn, / 

20 loaded */^^ ^^^^^^ lattice points (each 

direction) 

tor each PHORE £eatu.e */ ^^^^^ 
int nbits , / 
25 contents or EVAL 
fails) */ 

} PHORE, *PPHORE; 

" 0S.R.P.OC_BV.._PHO.K_U>C,.aMe„a.e. rov,, coXna.e, 

*/ 

/* 

*^ .1 jul-95 (PHORE LOG == lattice bitset 

35 /* Dick Cramer 31-Jul v 

) */ 
/* 
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*/ 



This .oaule generates bitsets whos. carainality is e,ual « 

*/ V. 

/* lattice points x 2 (# of sitepoint classes. For each 

' inllance of a pharmacophoric point in the .olecule heing 

p/olessed, the geoxnetrically nearest (1..)- bits in the 

10 /* billet will be set to 1 (where . is user supplied) . 
*/ 



/* 
/* 



/* 



/* 

15 a */ 
/* 



note: this routine explicitly requires that sets begin after 
first element that is the set sizel ! 1 



*/ 



/ 



*/ 

20 /* Inputs 



*/ 



/ 



* 



*/ 

/ * outputs 
25 */ 



/* 
/* 



*/ 

user Required Definition Files 
*/ 



30 /* 

********** / 

35 lnt''QSAR_PROC_EVAL_PHORE_LOC(tablenaBe, row, colname) 
char *tablenaine, *colname; 
int row ; 

{ 

mol_ptr mol; 
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PPHORE phr ; 

^j^t err, status, nvalid, mol_area; 

char 

set_ptr print, qsar_proc_calc_phore_set ( ) , 

5 FILE *fp; 

/* get the molecule */ ^at qf &mol^ ) 

if ( iTBL_UTL_GET_MOLECULE(tablename, row, FALSE, &mol) ) 

{ {err=l; 
if ( UTL_ERROR_IS_SET() ) 

10 goto 

error; } 

else return FALSE; 

/* git the user-provided input data */ ^olname 
\t ( lTBL_ATTR_FIND_COLUMN_A(tablename, colname, 

..PK0C_SUPP0RT", Sdu., ^^^^^^^ ^ ^^^^^3^. 

goto 

error ; } . . , 

20 /* retrieve DISCO stuff if not yet present */ 
if ( 1 phr->disco_in) { 
if (* .phr->disco_fn) {err=l; goto error;} 

25 ) ; 

UIMS2_EXEC_COMMAND( str ); 
UIMS2_EXEC_COMMAND( "DISCO INIT" ); 
phr->disco_in = TRUE; 

30 /* retrieve region if not yet present */ 
if (!phr->rgn ) { 

if ( .phr->region_fn) {err=l; goto error;} 

if (I(phr->rgn = QSAR_REGION_RETRIEVE ( phr->regxon_f n ) 

)) 

35 {err=4;goto error;} 

if fDhr->rqn->n boxes > 1 ) { 

sprintfT str, "WARNING: Region %s has %d boxes. 

Only first 

will be used. \n" , 
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• fn r>hr->rcin->n boxes ) ; 
phr->region_rn, pnr _ 

UBS_OUTPUT_MESSAGE( stdout, str ); 



phr->nbits = 2 * phr->rgn->n_points; 
^ ^ ^. ^,,n- f^irst the DISCO call */ 

) {err=12; 

goto error;} ..Cell_Support" and the 

10 /* go store both the bitsex; xn 

number of bits ^^^o-t-hina for the user to 

actually set in the "CELL", so there's something 

( .TBL_ACCESS_X_PUT_VALUE(table„a»e, row. coinage, 
15 ..CELL_SUPPORT», *).prl„t) ) 

'°'\rr-- -CESS X ™.UE(ta.le„a.e. rov. ^^U,.... 

^ - " (int *)&nvalid) ) {err-ll, 

20 goto 

error ; } 

return TRUE; 

""sprintf (Str, ..QSAR_PROC_EVAL_PHOPE_I^C (*a,", err,; 
25 UTL_ERROR_ADD_TRACE (str); 

return FALSE; 

set_ptr ,sar_proc_cal=_phcre,set, mol, phr. nvalla , 

/* creates actual bitset */ 
30 mol_ptr mol; 

PPHORE phr ; 

j^j^t invalid; 

^ ^ NTT nset = NIL, SYB FEAT_FIND_ID_SET(); 

set ptr anset = NIL, psei^ _ _ 

35 feat_ptr featp, SYB_FEAT_FIND_REC ( ) ; 

4-^™ r.i-y a SYB ATOM FIND_REC () ; 

atoin_ptr a, _ vvbase boff, It base[3], 

int err, elem, sitebase, ci, xybase, 

lt_off [3] , loff = 
0, hioff = 0 ; 
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f pt tmp ; 

BoxPtr bxptr; 

''Ifill itlL = UTL_SET_CREATE( phr->nbits ) )) {err = 1; goto 

5 error ; } 

*nvalid = 0; 

if (phr->nfuzz) { 

loff -= phr->nfuzz / 2; 

hioff += (phr->nfuzz + 1 ) / 2; 

10 } 

bxptr = phr->rgn->box_array; 

xybase = bxptr->nstep[0] * bxptr->nstep[ 1] ; 
/* generate the DISCO sites for this molecule, which */ 

UIMS2 EXEC COMMAND ( "ECHO %DISCO_SITES ( ) " ); 
15 /* became ""fEATURES" . "dunony atoms" within SYBYL's molecule 

data 

structure */ ^r.^^oPli-c=^ - 

pset = SYB_FEAT_FIND_ID_SET(mol, FEAT_V_LINE, 1, mol->nfeats) , 

if (pset ) { 

TileCc'elL = UTL SET_NEXT(pset,elem)) 1= NO_MORE_ELEM) i 

if ('(featp = SYB_FEAT_FIND_REC (mol,elem))) goto error; 
if ((featp->name[l] == 'S') (featp->name[2] - '_')) { 

/* have an H-bonding feature, it must represent a Ixne */ 

sitebase = f eatp->name[0] == 'A' ? 0 : phr->rgn->n_points , 
^he dummy atom at the end of the line is our H-bonding locus 

cdp = (line_ptr) f eatp->dataptr ; 

if (!(a= SYB_ATOM_FIND_REC (mol, cdp->positn) ) ) {err 2, 

30 goto 

error ; } 

for (ci = O; ci < 3; ci++ ) { j 
tmp = (a->xyz[ci] - bxptr->lo [ci] ) / 

bxptr->stepsize[ci] ; _ 
35 lt_base[ci] = (int) (tmp < 0.0 ? tmp 

bxptr->stepsize[ci] : 

tmp ) ; 

/* cycle\hrough all points touched by this locus that are also 



25 

*/ 



182 

within the 

lt_basetO] + hioff; 
5 lO«C0,.., ^ ^ ^ ..p„->„steptO„ 

lt_base[l] + 

for ,lt_off[2) = lt_basM21 * <- 

lt_base[2] + 

^ boff = xybase * lt_off[2] + 

(bxptr -> nstep[0]) * lt_off[l] 

^ lt_off[0] + sitebase; 

UTL_SET_INSERT ( anset, boff ); 
(*nvalid) ++; 

} 

} 

} 

25 UTL_SET_DESTROY ( pset ) ; 

} /* pset exists */ 
return ( anset ) ; 

""Iprintt (Str, ..q.ar_proc_oal=_phore_s.t (%d, , err); 
30 UTL_ERROR_ADD_TRACE (str) ; 

return FALSE; 

# This file determines the recognition of site points in 

Sybyl/DISCO. detailed documentation. The 

35 # see the SYBYL DISCO manual for detaiiea 



defined types 
are 
# 

occurences 



(1) HB : the QUERY is searched in the SEARCH mode, and all 
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are assigned DISCO features according to the 

# 

pining ^^^^.^.^^,,„„3 „ thr.a ATOMS rafar to the ato» 

, ~ in QOERV .uch th.t tha fa.tura is DIST from tha 
5 # 

first atom second atom at each 

^ at bond ANGLE with the first ana s 

TORSIONS for,ned by tha sita point and tha threa 
10 ATOMS in ordar. ^ ^.^ thasa axtanslon 

P°i"^=' „ -- the first atom is assigned a feature 

# 

coitipliinentary ^ /^„r.v, as HBD CO and 

to the extension point (such as nnu^ _ 

# 

RHBD_CO_) . . 4. +-v.^ nnaies and torsions are 

# (2) HBex:differs from HB m that the angles an 

7'^^^' two other arguments: whether lone pairs are part 

°' '"^^ extension point placement, and which ATYPE 

# 

(generally LP^^^^^^ ^^^^^^^^^ ^^^^^^^^^ sitepoints. 

25 #TYPE NAME ATOMS SEARC DIST ANGLE TORSIONS QUERY 

NoDup^^^T.r' 120 "0.0 180.0" 



# 
A 
# 
P 
# 
c 

15 # 







HB 


DS_02C2_ 


HevC(Any)=0[f ] 


HB 


DS 03Car_ 


30 HB 


DS__03Car_ 


HB 


DS_03Car__ 


HB 


DS_03Car_ 


HB 


DS__03Car__ 


HB 


DS_03C3__ : 



0[f]HC(Any)(Any)C(AnyM^ny,«x., ^^^^ 
HB DS_N3C3_ 1 4 5 NoDup 2.9 

r'rrnTrrr. .0 ..0.0.0.. a„.s,.oh=o,.„ 

„VPE name" atoms SEAPCH DIST LP ATVPE Quary 
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H" 



17:: OS 03C. 2 1 3 NODUP 2.9 VES "LP H" 

0C£lHC,*;y) (Any)3(Z:HevMC(AnyHAny,Any)^^ ,„„Z:CMC.Het, 
HBex DS.03C3_ 3 1 2 NoDup ''^^„„ , 

r^c N7n 2 14 Nodup 2.9 
"? HBex DS N3CJ_ ^ 

„tflH2VaZ(Z = HevSicKVa = «.c.o.,c:Hev> 

, o 1 3 NoDup 2.y 

HBex DS_N3C3_ 2 1 ^ 
KtflH(Ya)Ya{Ya:C.lC=0.1C:Hev} 

HBexDS_N3C3_ 3 1 2 NoDup 2.9 YES 
10 N[f3(Va)(Ya)Ya^Ya:C.lC=0.1C:Hev} ^^..^^.^h^C 
HBex DS_N2C2_ 2 1 3 NoDup 3.0 YES 
12 3 NoDup 3 . 0 
12 3 NoDup 3 . 0 



4 2 1 



HBex DS_N2C2_ 
HBex DS_N2C2_ 
HBex DS_N2N2_ 
15 HBex DS_N2N2_ 
HBex DS_N2N2_ 
hb DS_03S_ 
hb DS_03S 
HevS(=O[f])=0[fl 
20 hb DS_03S_ 4 2 

HevS(-0[f])(~OCfl)-OCfl 
hb DS_03N_ 3 2 

HevN(0[f l)0[f ] 
hb DS_02N_ 
25 HevN(Hev) ~0[f ] 
hbex DS_N2N2_ 
hb DS_03P_ 



YES "H LP" Any~N[f]-c 

vFS "LP" Any~N[r]=C[r] 

. 2 3 NoDup o.o YES r 11 H • C • C N [ f ] : C : @1 

2 1 3 NoTrxv 3.0 YES Lf . . „ . ^ . c -Nf f ] : C : @1 

2 1 3 NoTriv 3.0 YES "LP H" N[1]H.C.C N[ 

, 0 YES "LP" C:N[f]:Hev 

3 2 1NODUP 3.0 HevS=om 

3 2 1 NoDup 2.9 l-io 



All 2.9 



All 2.9 



4 All 2.9 
4 2 1 NoDup 



128 "0.0 180.0" 
128 "0.0 180.0" 
128 "0.0 180.0" 



2.9 



128 "0.0 180.0" 



3 2 1 NoDup 3.0 YES "LP" 



N:N[f ] :N 



1 2 All 



2.9 



128 "0-0 180.0" 



P(-0) (-0) (-0) 



30 



p,-O,(-0) (-0,(-O) "0.0 IBO.O.. 

hb DS_03P_ 3 1 -2 AX X ^^^^ 

# #CLASSNAMES# Acceptor_site Donor_^ ^ ^^^^^^^ o[f ]HC(:Hev) :Hev 
HB AS_H03C2_ 13 4 All 2. 9 1 - 3^3,,,. 

HB AS_H03C3_ 1 3 6 NoDup 2.9 117 
Otf]HC(Any)(Any)C(Any)(Any)Any 

HB AS_N3C3_ 1 4 7 NoDup 2.9 HO 
35 NCf3H2C(Any)(Any)C(Any)(Any)Any ^^^^^ 

HB AS_N3C3_ 1 5 8 NoDup 2.9 HO 
Ntf]H3C(Any)(Any)C(Any)(Any)Any 

#TYPE NAME ATOMS SEARCH DIST LP ATY ^ ______ 



HBex AS_HN2C2 
HBex AS_HN2C2 
HBex AS_HN2C2 
HBex AS_H03C3 



10 



2 13 NoDup 
3 2 1 NoDup 
6 5 4 NoTriv 

2 13 NoDup 
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3.0 "" "H" NHC(Any)=0[f ] 
3.0 YES "LP H" C:N[f]H:Hev 
3.0 YES "LP" N[l]H:C:C:N[f ] :C:@1 

2.9 YES "LP H" 



OCf,H0,Iny,(.;y).<Z=Hev.>c(*ny,,.ny).ny> 
HBa>t AS HN2C2 3 2 4 Nodup 3.0 1!ES n 
IZl As"'hN2C2'' 1 2 3 Nodup 3.0 VES "LP" HevNtf,=C 
HBexASHN2C2 2 1 4 Nodup 3.0 H" N[f]H2C(N) 

HBexAS:N3C3_ 2 1 4 Nodup 2.9 YES "LP H" 
N[f]H2C(Any) (Any)Z{Z:Hev&lC(Any) (Any) Any} 
HBex AS N3C3 2 1 5 Nodup 2.9 YES "LP H" 
N[f]H3C(Any) ( Any ) Z{ Z : Hev& ! C ( Any) (Any) Any} 



2.9 



15 



2.9 



2.9 



2.9 



20 



25 



30 



AO tiir-\ 2 13 NoDup 

HBex AS_N3C3_ ^ 

N[f]H(Ya)Ya{Ya:C&lC=0&lC:Hev} 

AO Kiri 2 14 NoDup 
HBex AS_N3C3_ z j- •* 

N[f]H2(Ya)Ya{Ya:C&!C=0&lC:Hev} 

HBex AS_N3C3_ 2 1 3 NoDup 
N[f]H(Ya)(Ya)Ya{Ya:C&lC=0&!C:Hev} 

AO xnrT 3 12 NoDup 

HBex AS_N3C3_ J ^ 

N[f ] (Ya) (Ya)Ya{Ya:C&!C=0&!C:Hev} 
HBex AS_HN2C2_ 2 1 3 NoDup 3.0 

3 12 NoDup 
2 14 NoDup 
2 13 NoDup 
12 3 NoDup 
6 5 2 NoDup 
2 13 
N[f](Z) (Z) (Z)Z{Z:C&lC=0&lC:Hev} 
hbexASHN2N2_ 3 2 1 NoDup 3.0 YES "LP" 
hb AS_03P_ 3 1 2 All 2.9 128 

P(~0) (-0) (-0) (~o) 

^o-o T 1 P All 2 9 128 "0.0 180.0" 

hb AS 03P_ 3 12 All 



YES "LP H" 



YES "LP H' 



YES "LP H" 



YES "LP" 



HBex AS_HN2C2_ 
HBex AS_HN2C2_ 
HBex AS_HN2C2_ 
HBex AS_HN2C2_ 
HBex AS_HNS3_ 
HBex AS HN4 



3.0 
3.0 
3.0 
3.0 
3.0 
NoDup 



YES "H LP" N[f]H=C 
YES "LP" N[f]=c-Any 
... ..H" N[f ]H2Hev(:Hev) :Hev 
N[f ]HHev( :Hev) :Hev 

HNC=Any 
AnyS(=0) (=o)N[f]H 

g III. "C*" 



I. II 



"H" 
"H" 
"H" 



-3 



N : N [ f ] : N 
"0.0 180.0" 

P(~0) (~0) (-0) 
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APPE^fDIX "C" 



PYPFRTMEMTAT r>ATA SETS 
Data Set HQ^DfCpds 



1 Uehling 
5 2 Strupczewski 

3 Siddiqi 

4 Garrattl 

5 Garratt2 

6 Heyl 
10 7 Cristalli 

8 Stevenson 

9 Doherty 

10 Penning 

11 Lewis 
15 12 Krystek 

13 Yokoyamal 

14 Yokoyamal 

15 Svensson 

16 Tsutsumi 
20 17 Chang 

18 Rosowsky 

19 Thompson 

20 Depreux 



9 

34 

10 

10 

14 

11 

32 

5 

6 

13 

7 

30 
13 
12 
13 
13 
34 
10 
8 

26 



Str-Miirc, Activity 
camptothecin, DNA fragmentation 
benzisoxazoles, ip Behavioral 
adenosines, Brain Al binding 
tryptamines, melanophore binding 
tryptamines, melanophore binding 
deltorphin, opioid receptor (DAMGO) 
adenosines, A2a agonists 
piperidines, NKl antagonism 

triarylbutenolides, endothelin-A antag. 

SC-41930 analogs, LTB4 antagonism 

oxazolinediones, NKl binding 

sulfonamides, endothelin-A antagonism 

oxamic acids, T3 binding 

oxamic acids, T3 binding 

benzindoles, 5-HTA agonism 

peptidyl heterocycles, endopeptidase inhib 

biphenyl sulfonamides, ATI binding 
trimetrexate analogs, DHFR inhibition 
peptidomimetic, HlV-1 protease inhibition 
naphthylethyl amides, melatonin displ. 



25 



30 



T_u^r^n.re Refer^-nrPs for Data Sejs: , , pp 

1 Uehling D.E., Nanthakamur, S.S., Groom, D., Emerson, D.L., L^uner, P.P., 
Luzzio, M.I., . al., synthesis, Topoisomerase 1 Inhibitory A^Wity and in V.vo 
Evaluation of 11-Azacamptothecin Analogs. J. Me^. CHem. 1995, 38, 1106 (Table 

2, with R2=Et; ICso data. 

2 S.„sU. ,.T., Bordeau. K.J.. Chiang, Y.. G.am.owsW, EX. Conway, P.O., 

al 3-[Kary.oxy)a,Mlpiperiainy.,-l,2-Be„z,— s as D2/5-HT2 An.go„.s.s 
with Potential Atypical Antipsychotic Activity: Antipsychotic Profle of Ilopendone 



3. 



10 



6. 



15 



20 



7. 



8, 



9. 



25 



30 
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^PS,3, S. M.«. .995, 3S. U.9. (TaHes 2 a.d 3 wUh „=3. X=0; ED„ 

Siaaioi, S.M.. "•^^^^^^'^•^,,«> Adenosine Analogs as SdecUve 
Search for New Purine- and Ribose Mod. ^ ^^^^ 

H »„»,onists at Adenosine Receptors. J. MeA "f""- 
Agonists and Antagonists at « ^ , „ „ disolacement and stereoisomers 

Crabte 1, R,-H; K,(Ai), values estimated from % displacem 

averaged as ne«led.) ^ ^ ^ 

Oarratt. P. J.. Jones. R T-^ • ^ ^^^^^^ ^„^„„i^ derived 

Receptor. 3. Design and Synthesis M ^^^^ 

- r ^chTo : en - — 
oarratt, P. U ^^-'l^^;:^ .gonists and An.gonists Derived 
Receptor. 3. Design and Synthesis ol , „j Table 2). 

- ;:, t . "i;- ^^^^ 

Key,. D.L.. Oandabuth a. -Jve ;^tide Deltotphin 1: PhC Replacement 
Binding Requirements for the &-Selecnve P ^ 
with Ring-substituted and Heterocyclic Amino Adds. X Me,. 
1242. (Table 1; binding K, to DAMGO.) _ ^ p a., « 2- 

Cristalli, G., Camaiom, E., Vitton, b., P ...N.ethyluronamide as 

.ral^„y,and2-He,eroal^ny,DerivativesofA_-5 N«^^^^_^ 

„ A2a Adenos^Re«P». ^^^^^ S., 
Stevenson, G.I., Macl^, A M. Huscro ^^.^^ ^ 

Baier, R. 4,4-Disubstituted P-pendines. A New Class 

O,™. .995. 3S. ^^^-^^^''2^^^^ ^ , K.A.. Reisdorph. B.R.. « 

Doherty. A.M.. Pan, W X Edmund, ^ 
„, Discovery of a Novel series of orally AC .^^^^.^e 3; 1C„ 

Receptor-selective Antagonists. J. Med. 1995, 38, 

is, T.D., D..C, S.W., Miyashiro, ..M Vu S 

Heterocyclic Replacement of the Memy 



10 



11 



12. 



13. 



14. 



10 15. 



16. 



15 



17. 



20 



18 



25 19. 



20 



30 
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c .u r:.i^r T mVage J Med. Chem. 1995, 28, 923. 
Endothelin inhibitors, i. M.rf. aem. 1995, 38, 659. 

Yokoya^a, N., Walker, G.N., Main, A.I. Stanton, I.L. Momssey .a/. 
I;Tesis Ind SAK of 0.a.ic Acid ana Acetic Acid Derivatives Related to L- 
Thyronine. J. M.rf. C/zem. 1995. 38, 695. 

Yolyama N., Walker, G.N., Main, A.J. Stanton, J.L. Momssey, M . at. 
sl2Z ind S A. Of Oxa.ic Acid and Acetic Acid Derivatives Related to 

^,onine^. Me. <^-- ^^^^^^ ,,,, m.W., Lin, C.-H. 

Haadsma-Svensson, S.R., Svensson, 5,9b-Hexahydro-3-propyl- 
r 9 and N-Substituted Analogs of cis-(3aR)-(-)2,3,3a,4,^,^ 
C-9 and N ^.^^^ ^.^j^ ^^^^3 Degrees 

lH-benz[e]indole-9-carboxamide. 5HT1A Kecepio g 

of Metabolic Stability. J. Med. Chem. 1995, 38, 725. 

T t.Mmi S Okonogi, T. Shibahara, S., Ohuchi, S., Hatsushiba, E., et al., 

I'd Str^ tur Activity Relationships of Peptidyl @-Keto Heterocycles as 
Synthesis and Structure Activity 3492. (Table 2, 

Novel inhibitors of Prolyl Endopeptidase. J. Med. Chem. 1994, 37, 

X=:CH,CH2;lC5o.) Chen Ts-Bau.,0'Malley,S.S.,ef 

„ T T Ashton W.T., Flanagan, K.L., <-nen, ib. d , 
Chang, L.L., Ashton, w. , Receptor Antagonists with 

High Affinity for Both the AT. and AT, Subtypes. J. me 
rrable 1 R' =(2-Cl)QH5; AT, [rabbit aorta] IC50.) 

/m r E Wright J E., Queener, S.F., 2,4-Diamino-5- 
Rosowsky, A.,Mota,C.E.,Wnght,J.i.,v cvnthesis and Antifolate 

chloroquinazoline Analogs of Trimetrexate and Pintrexim: Synth sis and 
Activity. 7. M...ae.. 1994, 37, 4522. (Table 2; rat liver IC 

. X* 1^ H M Zhao B., Winbome, E., Green, D.w., « 

Based HIV-l Protease Inhibitor Containing a Heterocycltc PI -P2 Amide 
isostere J Med. Chem. .994. 37, 3100. CTable 2, X-Boc; apparent K, ) 

P., Lesienr. D., Mansour, H.A.. Morgan, P.. e. a, SV-Hests and 
Tt ri-Ac-ty Pe,a.io„sHips . Nove. Hapbtbaienic ^^^^^^^ 
Annidic Derivatives as Melatonin Receptor Ligands. X Med. Cu.. .994, 37, 



# 
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APPENDIX "D" 

, or ns CO— -.,e .o,s .o.en ^ ^ "rjL. THe 
731 Clusters are sorted by proposed name, first by 

231 Clusters c„ ,nd then by the substitution pattern on that root 

attached immediately to the -SH, and b ,,^3, ie., structures in 



# 
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Cluster Cluster Struct. Structural 

Sj2e Roo^ Substitution* 



1 

5 144 
177 
163' 
151 
33 

10 80 
192 
7 
27 
107 

15 189 
141 
205 
188 
56 

20 138 
190 
41 
152 
16 

25 85 
106 
77 
142 



26 



aryl Simple 



1 aryl 2,3,5-Me 

1 aryl 2,3,5-Me-4-Pr 

1 aryl 2,3-(4-(2,3-Pr)5het)5hetO 

1 aryl 2,3-(4-Bu)5hetO-5-Me 

5 aryl 2,3-Benzo 

2 aryl 2,5-Me 

1 aryl 2,5-Me-3-iPe 

14 aryl 2,6-NoH-3(4/5)-Me 

6 aryl 2,6-NoH-3-Ar 

2 aryl 2-(2-Bz)PheEt-4,5-Benzo 
1 aryl 2-(3,5-Me)Ar-4,5-Benzo 
1 aryl 2-(4-Et)PhePr 

1 aryl 2-(4-Stilbenyl)Stilbenyl 
1 aryl 2-5hetCH2-4,5-Benzo 



3 aryl 



6 

1 aryl 
9 aryl 



2 aryl 
1 aryl 



2-Ar 



1 aryl 2-Ar-3,5-Me 

1 aryl 2-Ar-4,5-(3,4-Et)Benzo 



aryl 2-Ar-4,5-Benzo 
2-Bz 
2-Et 

2-NoH-3-Et-5-Me 



2 aryl 
2 aryl 2-PheEl-4,5-Benzo 

2-PhePr 

2-R8 



191 



^1 2-SUlbenyl 

121 ^ 3,4-(3-Me)Benzo 

97 ^ 3,4-(a,b)IndenO 

218 ^ 3,4-(a,b,(8-Ar)IndenO)-6-Me 

164 ^ 3,4-(a,b,(c-Me)IndenO) 

5 98 ^ aryl 3,4-(a,b-Naphtho) 

\ aryl 3,4-Ar 

157 3,4-Benzo-5-Me 

58 ^ 3,4-Benzo-6-tBu 

' aryl 3,5-Me 

10 37 3.(2,3-Benzo-4-Et)5het 

180 ^ 3.(2,3-Benzo-5-Me)5het 

199 ^ 3-(2-Me-3-5het-5-Et)5hei 

182 ^ 3-(3-5het)5hei 

115 ^ 3-(3-Ar)5het-4-Me 

15 ' 3-Ar 

67 ^ 3-Ar-4-(2-Me)5hetCH2 



25 88 



2 



129 ^ 3-Ar-5-Me 

aryl 3-Bz 



1 



155 3-Bz-5,6-Benzo 



2 



20 82 3-Me 

10 3.Naphth 

70 , 3.pr-4-sBu-6-Me 

aryl ^ 



3 



aryl 3-iPr 



2 



95 aryl 4-Ar 



2 
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81 2 aryl 

48 4 aryl 

2 23 aryl 

92 2 aryl 

5 90 4 aryl 

19 8 aryl 

148' 1 aryl 

228 1 aryl 

12 10 5het 

10 50 4 5het 

139 1 5het 

89 2 5het 

173 1 5het 
69 3 5het 

15 198 1 5het 

174 1 5het 
171 1 5het 
170 1 5het 
123 2 5het 

20 22 7 5het 

202 1 5het 

122 2 5het 

197 1 5het 

6 14 5het 

25 225 1 5het 



4-Bz 

4-Et 

4-Me 

4-R9 + 

4-iBu 

6-NoH 

(adenosine) 

(fluorescein) 

Simple 

2,3-(a,b-Naphtho) 

2,3-5hetO-4-Me 

2,3-Ar 

2-(2,5-Et)Ar-3-Et 

2-(2-Me)Ar-3-(2-Me)PheEt 

2-(2-Me)Ar-3-R10 

2-(2-sBu)-3-Et 

2-(3,5-Me)Ar-3-5het 

2-(3,5-Me)Bz-3,4-Benzo 

2-(3-Et)Ar-3-Bz 

2-(4-Et)Ar 

2-(4-Et)Ar-4-(4-Me)Ar 
2-(4-iPr)Ar-3-Bz 
2-5hetCH2-3-(4-tBu)Ar 
2-Ar 

2-Ar-3-(2-Ar)5hetBu 



# m 

193 





224 


1 


5het 


2-Ar-3-(2-Ar)5hetCH2 




63 


3 


5het 


2-Ar-3-(2-Bz)Ar 




178 


2 


5het 


2-Ar-3-(2-Me)5het 




72 


3 


5het 


2-Ar-3-(3,4-Et)Bz 


5 


40 


5 


5het 


2-Ar-3-(3-Ar)5HetEt 




183 


1 


5het 


2-Ar-3-(3-Ar)PhePr 




64 


3 


5het 


2-Ar-3-(3-Ar-5-Me)5het 




105 


2 


5het 


2-Ar-3-(3-Me)Ar 




160 


1 


5het 


2-Ar-3-(4-Ar)Cyhx 


10 


146 


1 


5het 


2-Ar-3-(4-Ar)CyhxCH2 




203 


1 


5het 


2-Ar-3-(4-PheEt)Ar 




126 


2 


5het 


2-Ar-3-(tBu)Ar 




17 


9 


5het 


2-Ar-3-Ar 




211= 


1 


5het 


2-Ar-3-Benzylidene 


15 


124 


2 


5het 


2-Ar-3-IndenCH2 




28" 


6 


5het 


2-Ar-3-Me 




30 


6 


5het 


2-Ar-3-PhePr 




204 


1 


5het 


2-Ar-5-(4-(2,4-Me)Bz)Ar 




79 


2 


5het 


2-Bz 


20 


78 


2 


5het 


2-Bz-3,4-Benzo 




117 


2 


5het 


2-Cyhx 




186 


1 


5het 


2-Cyhx-3,4-iPe 




68 


3 


5het 


2-Et 




112 


2 


5het 


2-Et-3-(2-Me)PheEt 


25 


128 


2 


5het 


2-Me-3,4-(3-Me)Benzo 



194 



93 2 5het 

61 3 5het 

181 1 5het 

49 4 5het 

5 86 2 5het 

91 2 5het 

4 17 5het 

172 1 5^^^ 

38 5 5het 

10 13 10 5het 

222 1 5het 

66 3 5het 

29 6 5het 

71 3 5het 

15 108 2 5het 

127 2 5het 

54 3 5het 

221 1 5^^^ 

187 1 5het 

20 143 1 5het 

96 2 5het 

162 1 5het 

169 1 5het 

94 2 5het 

25 210 1 5het 



2-Me-3,4-Benzo 

2-Me-3-(2,3,4-Me)5het 

2-Me-3-(2,3-Benzo-4-Et)5hel 

2-Me-3-(3-Ar)5het 

2-Me-3-(3-Ar)5hetPr 

2-Me-3-(3-Ar-5-Me)5het 

2-Me-3-(3-Bz)Ar 

2-Me-3-(4-tBu)PheEt 

2-Me-3-5Het 

2-Me-3-Me 

2-Me-3-Pe 

2-Me-3-PheEt 

2-Me-3-PhePr 

2-Me-3-R8 + 

2-Me-5-Bu 

2-Pe-3-Ar 

2-Pr 

2-R12 

2-iBu-3,4-iPe 

2-iPe-3,4-Benzo 

3,4-(2,4-Me)Benzo 

3,4-(3-Ar)Benzo 

3,4-(3-Hx)Benzo 

3,4-(3-Pr)Benzo 

3,4-(a,b-Napththo) 



195 





36 


15 


5het 


3,4-Benzo 




176 


1 


5het 


3-(2,4-Me)Bz 




196 


1 


5het 


3-(3,5-Me)Ar 




159 


1 


5het 


3-(3-Ar)5het 


5 


42 


4 


5het 


3-(3-Bz)Ar 




200 


1 


5het 


3-(3-Me)PheEt 




113 


2 


5het 


3-(4-Me)Ar 




125 


2 


5het 


3-(4-tBu)Ar 




191 


1 


5het 


3-(Al-4-Et)PheEt 


10 


145 


1 


5het 


3-(B-Ar)PhePr 




114 


2 


5het 


3-5hetCH2 




18 


8 


5het 


3-Ar 




59 


3 


5het 


3-Ar(2-thia) 




65 


3 


5het 


3-Bu 


15 


24 


7 


5het 


3-Me-5-H 




44 


6 


5het 


3-Me-5-NoH 




52 


5 


5het 


3-Pe 




111 


2 


5het 


3-PheEt 




153 


1 


5het 


3-PhePr 


20 


32" 


6 


5het 


3-Pr 




223 


1 


5het 


3-R13 




185 


1 


5het 


(chrysenO) 




34 


5 


alkyl 


Simple 




104 


2 


alkyl 


(3)(B1)(B1) 


25 


62 


3 


alkyl 


(3-Me)PhePr 



196 





3 


18 


alkyl 


(3:4) 




14 


9 


alkyl 


(3:4)(A1) 




60 


3 


alkyl 


(3:4)(B1) 




226 


1 


alkyl 


(4)(Al)(A-tBu)(Cl)(Cl) 


5 


45 


4 


alkyl 


(4)(D1)(D1) 




35 


7 


alkyl 


(4-Me)PhePr 




168 


1 


alkyl 


(4-iPe)PhePr 




47 


4 


alkyl 


(5)(A1) 




179 


1 


alkyl 


(5)(Bl)(E-(2-Ar-5-Me)5het) 


10 


103 


2 


alkyl 


(5)(B3) 




76 


2 


alkyl 


(5)(C1)(C1) 




83 


2 


alkyl 


(5)(C2) 




216 


1 


alkyl 


(5)(C2)(D2)(D2) 




43 


8 


alkyl 


(5:6)(D1/B1/F1) 


15 


5 


15 


alkyl 


(5:7) 




158 


1 


alkyl 


(6)(B8)(C1)(E1)(E1) 




140 


1 


alkyl 


(6)(F-Ar) 




166 


1 


alkyl 


(7)(A8)(F1) 




53 


3 


alkyl 


(7)(D3)(D3) 


20 


207 


1 


alkyl 


(8)(C3) 




8 


13 


alkyl 


(8:11) 




206 


1 


alkyl 


(9)(B4)(G3) 




75 


3 


alkyl 


(10)(B1)(E5)(E1) 




136 


1 


alkyl 


(10)(C1)(E5)(E2) 


25 


20 


8 


alkyl 


(10+)(B1) 



197 





39 


7 


alkyl 


(11+)(B1) 




154= 


1 


alkyl 


(12)(A-PheEt) 




230 


1 


alkyl 


(12)(F6)(F1) 




131 


2 


alkyl 


(12)(F6)(F6) 


5 


15 


9 


alkyl 


(12+) 




137 


1 


alkyl 


(13)(E4) 




231 


1 


alkyl 


(A-Ar)(A-Ar)Bz 




229 


1 


alkyl 


(A-Bz)(A-Bz)PheEt 




184 


1 


alkyl 


(Al)PheEt 


10 


227' 


1 


alkyl 


(cholesterol) 




214' 


1 


alkyl 


(cryptate) 




23 


7 


alkyl 


PheBu 




74 


3 


alkyl 


PheEt 




25'' 


6 


alkyl 


PhePr 


15 


11 


10 


benzyl 


Simple 




102 


2 


benzyl 


2,4,5-Me 




57 


3 


benzyl 


2,4,6-Me 




217 


2 


benzyl 


2-(3-(2-Et)Ar)Ar 
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1 


benzyl 


2-Et-3-(2,3-Et-5-Me)Ar-5-Me 


20 


212 


1 


benzyl 


2-R8-3-Naphthyl-4,5-Benzo 




9 


13 


benzyl 


2/3-Me 




84 


2 


benzyl 


3,4-Benzo 




132 


2 


benzyl 


3,5-Me 




130 


2 


benzyl 


3-(4-Stilbenyl)Stilbenyl 


25 


134 


2 


benzyl 


4-(3-Ar)Ar 
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5 



10 



15 



20 



25 



21 


7 


benzyl 


4-Et 


26" 


6 


benzyl 


4-Me 


156 


1 


benzyl 


4-PhePr 


201 


1 


benzyl 


4-tBu 


135 


2 


alkenyl 


Ar..(2-Et)Ar 


220 


1 


alkenyl 


Ar..(4-Bz)Ar 


116 


2 


alkenyl 


Ar..Ar 


133 


2 


alkenyl 


Ar..Bz 


110 


2 


alkenyl 


Et.CN.C0NH2 


87 


2 


alkenyl 


NH2.cn. N=NPh 


119 


2 


alkenyl 


P(NMe2)3..Ar 


120 


2 


alkenyl 


P(Pr)3..Ar 


118 


2 


alkenyl 


P(iPe)3..Ar 


51 


4 


alkenyl 


PCyhx3..Ar 


195= 


1 


alkenyl 


PEt3..(2-Bz)Ar 


31" 


6 


alkenyl 


PEt3..Ar 


194 


1 


alkenyl 


PEt3..Bz 


109 


2 


alkenyl 


PheEt.CN.CONH2 


101 


2 


cyclohexyl 


Simple 


149 


1 


cyclohexyl 


l-Me-2,4-CMe2 


55 


3 


cyclohexyl 


2,3,4,5-iBu 


147 


1 


cyclohexyl 


2,3,4-iBu-5-iPe 


209 


1 


cyclohexyl 


2-(3,4-PheEt)5het-6-Me 


208 


1 


cyclohexyl 


2-Me-3,5-CMe2 


167 


1 


cyclohexyl 


2-Me-4-sPe 
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165 1 cyclohexyl 2-iPr-3,5-Me 

150 1 cyclohexyl 3-sPe-6-Me 

161 1 cyclohexyl 4-Et-4-iBu 

219 1 cyclohexyl (complex) 

175 1 cyclopentyl 2-Ar-4-spiro 

215 1 cyclopentyl 3-PhePr 
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*To generate these names, all heteroatoms are first replaced by carbon (to produce the 
simplest common topology) and a particular structure is chosen from among these 
topologies as the "most typical" of that cluster, if possible to contain the largest 
substructure that distinguishes that cluster from all others. 

5 Within the name of a substitution, numbers indicate positions when substitution is on a 
ring, but chain length when substitution is on a chain (numbers separated by a colon 
indicate a range of chain lengths). Also, within a chain, letters indicate a position of 
substitution. (For example, (C2) describes a two atom branching from the third position of 
a chain, while 3-PhePr describes a phenyl propyl skeleton attached to the 3-position of a 

10 ring. ) 

A dot notation (.) separates the three possible substituents on an alkenyl root, the 
substituent order being same carbon as the -SH substituent, then the position trans to the 
-SH, and finally cis to -SH. 



The above notwithstanding, any name enclosed completely in parentheses takes its usual 
15 structural meaning. 

Here are structural descriptions for each name abbreviation in the above table, mostly in 
SLN (SYBYL Line Notation), listed alphabetically. (SLN extends SMILES with the 
following concepts, among others. Hydrogens are explicit. Ring openings and closures 
begin with a number enclosed by □ and end with the matching number preceded by @. 
20 Other SLN symbols used in these SLN definitions are: - = any bond; - = single bond 
(used here to provide a reference for [R]) : = aromatic bond; ! = the SLN following (here 
in parentheses) is not allowed; [F] = no additional atoms may be attached to the preceding 
atom; [!R] = preceding bond may not be in a ring; [R] = preceding bond must be in a 
ring.) 
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5het = 5Het = C[1]:C:C:C:C:@1. alkenyl = C=C. alkyl = C-[!R]C. aryl = Ar = 
Phe = Ph = C[1]:C:C:C:C:C@1. benzyl = Bz = HSC-[!R]C~ [R]C. Bu = 
C-[!R]C-[!R]C-[!R]C-[!R]C. cyclohexyl = Cyhx = C[l](- | =)C-C~C-C~C~@1. 
cyclopeotyl = C[l]~(- | =)C~C~C-C~@1. Et = C-[!R]C. inden = 
5 C[1]:C(~C~X~[2]):C(~@2):C:C@1. iBu = C-[!R]C-[!R]C(-[!R]CH!R]C. iPe = 
C-[!R]C-[!R]C-[!R]C(-[!R]C)-[!R]C. Me = C. naphth = 

C[1]:C(~C~X~[2]):C(-@2):C:C:C@1. NoH = !(CH). O denotes ring fusion, e.g., 
benzo fuses a 6-membered aromatic ring. Pe = C-[!R]C-[!R]C-[!R]C-[!R]C-[!R]C. 
Pr = C-[!R]C-[!R]C-[!R]C. R# = alkyl chain of approximate length ft. Simple = 
10 !(C-[!R]C). sPe = C(-[!R]C)-[!R]C-[!R]C-[!R]C-[!R]C. StUbenyl = 
C = [!R]C-[!R]C[1]:C:C:C:C:C@1. tBu = C(-[!R]C)(-[!R]C)-[!R]C. 
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A ppendix "E 



It 



The following replaces section E contained in the priority applications. Not all of what was 
5 previously in E is included here, because the latest versions of BUILD_3D etc. are 
provided separately in Section A. 

The first phase of construction of a combinatorial library takes 

as input a description of the chemical transformation represented 
10 by that combinatorial library and a list of available reagents such 

as the Available Chemical Directory (ACD), and produces as output 

all the part structures (aka substructures or fragments), in product 

form, found in the list of available reagents which are appropriate for 

the chemical transformation, along with all structure-invariant 
15 physicochemical properties of those fragments that might be useful in 

diversity design (Opti verse) or searching (VL). 

In the course of this process, data are recorded permanently into three 
tables: 

REACTIONS (a Molecular Spreadsheet) = information about a 
20 reaction scheme. Each record corresponds to a reaction, 



25 



where PanLabs or the manager of the VL designates 
what is a reaction. A typical reaction would be: 
"reaction of each nitrogen of a diamine with various 
reagents such as acids (acylation) or ketones (reductive 
amnination)". 



30 



REAGENTS (a Molecular Spreadsheet) = information about a 
particular set of reagents used in some instance of a 
reaction. Each record corresponds to a particular logical 
reagent structure search in a database such as Available 
Chemical Directories, presumably a set of reagent structures 
which will all react in the same way. For example, there are 
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sixteen reagent records for the diamine reaction, enumerating 
each of eight reactant classes that might react with 
each of the two nitrogens. One record for example describes a 
reaction with epoxides, that could be ring opened nucleophilically 
5 (and regioselectively) by an amine to yield a beta-amino alcohol. 

RDATA (an Oracle Table) = invariant physicochemical data computed about 
compound fragments, typically the varying portions in a 
cSLN, with one record for each fragment encountered in ANY 
cSLN constructed. Thus data need not be recomputed when such 
10 fragments are reencountered, a substantial savings in processing time. 

For example, records will be added describing the properties of 
a -CH2CH(OH)R chain (product fragment) for each (new) epoxide-R 
reagent retrieved by the example record just given for the REAGENTS 
spreadsheet. 

15 Entering a new reaction into the system involves adding a new row to 

REACTIONS and at least two new rows to REAGENTS, by hand. This data 
entry operation is the only required data entry in preparation for virtual 
library production. 

All other operations on these entities are carried out by the SPL script 
20 getacd.core, executed within SYBYL. This script is reproduced below in its 
entirety. 

The major overall output of getacd.core is a set of files for a reaction, 
whose base (file set) names are constructed by concatenating record numbers 
from the REACTIONS and REAGENTS tables, and whose prefixes are as follows 
25 .files = explicitly contains the names for all other files. 

.csin = the template or prototype for the construction of a particular 
cSLN. If there is more than one possible core for a particular reaction, their 
structures and properties are recorded in the optional .cores file. 

.X1,.X2,.. = a "hitlist" having an SLN with property contributions 
30 for each unique fragment or variation at a particular position. Each 
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variation site has its own hitlist file. 

.cores = similar to an .Xn file, but describes available variations 
in a cSLN core. For example, the .cores file for the diamine reaction lists 
SLNs of the cores and properties that each of the commercially available 
5 diamines would contribute. 

Two intermediate data tables are used in some of the operations of 
getacd.core, as molecular spreadsheets: 

HITS = results of a particular reagent search, also records 
information about supplier, catalog number, and price. 
10 RSCRATCH = a "work table" used for calculation of side chain 

properties. 

To aid in understanding the getacd.core SPL script which follows, here are 
descriptions of the individual "columns" (aka attributes, fields) for each 
of the tables introduced above. 

15 REACTIONS: 

NAME (text) For user recognition only 

CLASS_ID (integer) A "global" identifier for a particular reaction scheme 
VARIATION (integer) Can be more than one per CLASS_ID, intended to distinguish 
among different reaction conditions for a particular reaction. This 
20 value is the key linking REACTIONS and REAGENTS 

NREAG (integer) Number of rows in REAGENTS for this reaction. Used only for 

checking self-consistency of user input. 
CORE_SLN (text) The SLN of the core for this reaction, along with information 
needed by the cSLN builder to correctly attach side chains, or, especially, 
25 to correctly merge polyvalent variations with an invariant core. 

example of a record (diamine reaction, producing the R5V2Rn fileset), 
broken into two lines for clarity: 

1 2 3 4 





205 



NAME 



CLASS ID VARIATION NREAG 



5 Row5 Piperazine 



5 



2 



16 



5 



5 



CORE SLN 



5 Row5 N[1](X1)CH2CH2N(X2)CH2CH2@1 2,X1R1 = 1;10,X2R1 =9 



REAGENTS: 
ID (integer) invariant identifier for this record 
10 VARIATION (integer) link to REACTIONS by many-to-one relation 

SEARCH_SLN (text) SLN for the reactive fragment, which any reactant 
molecule (e.g., within ACD) must contain in order to undergo 
the particular reaction 
NOTLIST (text) combination of SLNs and files (of additional SLNs) 
15 for fragments that must NOT be contained within any reactant 

to be used in this reaction. (Reasons include interference 
with this or other reactions in the sequence, or toxicity 
to biological systems.) 
PRUNE_SLN (text) similar and usually identical to SEARCH_SLN but 
20 may not contain any atoms or bonds of type "Any", needed 

while processing the individual reagent to overcome some 
quirks in SLN processing within SYBYL. 
SAME_AS (text) a hitlist file name. If present, this file's contents 
are used instead of an explicit reagent search that need not 
25 be done. (For example, the list of acids that react with 

piperidine are identical for each of the two nitrogens.) 
HOW (text) a series of structural modification commands which the 
script uses to convert a reactant structure into the corresponding 
atoms within the product. Atom ID references within these 
30 commands are sequence numbers of that atom within the PRUNE_SLN, 
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or to names of atoms generated in a previous command. 

Example: An isocyanate (PRUNE_SLN is CN-C^O) becomes most of a urea 
(CNHC(=0)X1) when reacted with an amine. Here is the HOW for this 
transformation: 

5 BREAKB,2,3 ATYPE,2,N.am ATYPE,3,C.2 FILLV,2,H,A1 FILLV,3,H,A2 MARKX,A2 

(reading left to right: the N=C becomes single; the N is made trivalent; 
the C is made sp2; hydrogen named Al is added to the N; hydrogen named A2 
is added to the C; the A2 is marked as designating a "free valence" whenever 
a cSLN is expanded.) 

10 ATTACHED (text) the file extension for the output file of cSLN variations 
that this record produces. 

TEMPLATE (text) for polyvalent variations only, information needed to 
build an aligned topomeric conformation, as follows: 

Argument 1: a file containing a pre-aligned 3D structure. 
15 Argument 2: the SLN of the template within the 3D structure produced 

by joining the reactant molecule to the pre-aligned structure. 
Argument 3: the name of an SPL macro that performs any additional 

structural operations needed to generate the topomeric conformation. 
Argument 4: Any additional arguments to be passed to the macro named 
20 in argument 3. 

Example: aram.mol2 NH-CHCH2C(: Any):CH ACD!FIX_FUSE 10,11 

VALENCES (integer) the number of valences within each of these variations. 
FGPT_XTRA (text) for optimal fingerprint estimation, the SLN for any atoms 
that this particular record will ALWAYS add to the core. For example, FGPT_XTRA 
25 for the isocyanate acylation example in HOW is: C( = 0)NHC 



EXAMPLE: Here is the record for the reductive amination reaction in which 
a carbonyl (aldehyde or ketone only) is condensed with a primary or secondary 
amine and then reduced to the amine with borohydride. 
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1 2 3 

ID VARIATION SEARCH_SLN 

13 ROW13 13 2 HcC(=0)C(-:Any)-:Any{Hc:H|C(-:Any)} 

5 4 5 

NOTLIST 



13 ROW13 badls.kal 0=CO[f] 0=COH 0=CC=0 0=CAnyC=0 \ 
HcC(=0)C(-: Any)(-:Any).HcC(=0)C(-:Any)(-:Any){Hc:H | C(-:Any)-:Any} \ 
10 NH(Hc)C(Any)Any{Hc:H | C(Any)Any} 

5 6 
PRUNE SLN SAME AS 



13 ROW13 HcC(=0)C{Hc:HjC} 
15 7 

HOW 



13 ROW13 BREAKB,2,3 DELA,3 ATYPE,2,c.3 FILLV,2,H,A1 MARKX,A1 

20 8 9 10 

ATTACHED TEMPLATE VALENCES 



13 ROW13 XI 
11 

25 FGPT XTRA 



13 ROW13 CHC 



RDATA 

CRC (number(10,0), primary key) a "cyclic redundancy code", used most 
30 often to verify the integrity of data communication packets, generated here 
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from the SLN to enable fast exact substructure match searching of an Oracle 
table. (Rare ties in CRC values fOr non-identical SLNs are broken by appending 
<name=junk> to the duplicate-generating SLN and attempting to reregister 
until a unique CRC is generated.) 
5 SLN (text) SLN of a fragment, open valence(s) at point(s) of attachment. 

LOOP (NUMBER(6,2)) logP of the fragment, calculated for the structure 
where all open valences are filled with H's. A value of 99.99 denotes 
"could not be calculated". 

MW (NUMBER(10,2)) molecular weight of the fragment exactly as described 
10 by the SLN. A value of -LO denotes "could not be calculated". 

TOPOMERIC (text) a textual representation of the CoMFA steric field for 
the topomeric conformation of this molecule. (The 3D SLN of this conformation 
is written to a file in the fileset with extension .fal, for possible future 
reference.) 

15 NROTBONDS (NUMBER(2,0)) number of bonds whose torsional values were set 
for this side chain during generation of the topomeric conformation, 

PH_AS, PH_DL, PH_DS, PH_AL, PH_AR (NUMBER(2,0)) number of pharmacophoric 
points within this side chain, of different classes as defined by DISCO and 
SYBYL 6.3/Unity 2.6. 

20 # following are definitions of oracle queries used for referencing table RDATA 
# within SPL. 

RDBMS REFERENCE DEFINE oracIe_rdata tripos oracle castor \ 

MACHINE_ACCESS_INFO explicit_userid lawless explicit_password jlu816yl \ 
RDBMS_ACCESS_INFO EXPLICIT USERID adsvl explicit__password adsvl \ 
25 DONE 

RDBMS QUERY DEFINE RDATA_DATA oracle_rdata 

select SLN,LOGP,MW,TOPOMERIC, NROTBONDS from RDATA where 
CRC = : NEW CRC 
#. 

30 DONE 
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ALL THAT REMAINS IS GETACD.CORE AND CERTAIN FILES FROM 
CHOM_BATCH. CORE 



5 ft There are only two important user entry points 

# "optiv" for most purposes 

# "cores" for building the .cores file (to be replaced) 

©macro optiverse sybylbasic 
# 

# sets global state variables, then dispatches tasks in order 
ft 

# $1 is a set of reaction IDs 

15 # $2 is a set of modifiers (variations to be skipped, NoSearch, Test, ) 
ft TEST = only the first item in each hitlist is processed 

# (allows checking out all input data quickly) 
ft DEBUG = uims ver on at all times 

ft RONLY = only process specified rows in REAGENTS 
20 ft NOSEARCH = skip search (hit lists must already exist in 

# working directory) 

ft NOCAT = skip concatenation of Xn files 
ft SEARCH = ONLY do search 
ft BUILD = ONLY convert hitlists to Xn files 
25 ft CSLN = ONLY build CSLN template 

ft CORES = ONLY do core search and processing 
ft numeric values - two interpretatioons 

# if RONLY, these are the ROW IDS in the REAGENT MSS to use 

# if not RONLY, these are VARIATIONS to NOT process 
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globalvar ACDicmd ACDldb ACDlinited CHOMIErr ACDIPool ACDlXs \ 
ACDIDoSearch ACD'.Test ACDiPrice ACD'.Password \ 
ACD!Preferred_supplier ACDIqprop \ 

ACD'.cost ACDIsupplier ACDIFCD ACD!Only_Rs ACDINoCAT ACDlSites 
5 localvar nrg rxrow rcrows v vars rxn dosearch dobuild docsln docores 



setvar args2 %uppercase( "$2" ) 

setvar ACDIDoSearch %not( %set_and( NOSEARCH "$args2" ) ) 
setvar ACDlTest %set_and( TEST "$args2" ) 
setvar ACDiPrice %not( %set_and( TEST "$args2" ) ) 
10 setvar ACDINoCAT %set_and( NOCAT "$args2" ) 

# initialize other data if not done in a previous optiverse run 
if %not( $ACD!inited ) 
ACDinit 

if %not( $ACD!inited ) 
15 return 
endif 
endif 

setvar ACDIonly rs 
if %set_and( RONLY "$args2" ) 
20 setvar ACD!only_rs $args2 

endif 

setvar dosearch TRUE 
setvar dobuild TRUE 
25 # next is obsolete . . 
setvar docsln 
setvar docores TRUE 

setvar procs %set_and( SEARCH,BUILD,CORES,CSLN "$args2" ) 
if Sprocs 

30 # if subprocess(es) specified set all false, only those specified on 
setvar dosearch %set_and( SEARCH "$args2" ) 
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setvar dobuild %set_and( BUILD "$args2" ) 
setvar docsin %set_and( CSLN "$args2" ) 
setvar docores %set_and( CORES "$args2" ) 
endif 

5 for rxn in %set_unpack( $1 ) 

setvar vars %tblsrch_val( REACTIONS CLASS JD $rxn ) 
if %not( $vars ) 

%dialog_message( ERROR \ 
"REACTIONS has no entry for a Class ID of: $rxn" "Bad REACTIONS Data" ) 

10 return 
endif 

if %set_and( DEBUG "$args2" ) 
uims ver on 

else 

15 uims ver off 

endif 

%file_delete( startup. pho ) >$nulldev 
photo on startup.pho >$nulldev 

setvar nv 1 
20 for V in $vars 

# allow variations to be skipped 

if %or( "%not( $args2 )" "%not( %set_and( "$v" "$args2" ) )" ) 
echo Variation $nv (ID: $v) of %count( $vars ) 
setvar nv %math( $nv + 1 ) 

25 TABLE DEFAULT REACTIONS 

setvar nrg %rcell( $v NREAG ) 
setvar rets %rcell( $v VARIATION ) 

setvar rcrows %tblsrch_val( REAGENTS VARIATION $rcts ) 
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if %not( %eq( $nrg %count( $rcrows ) ) ) 
%dialog„message( ERROR \ 
"For Variation $v of Reaction $rxn,\ 
REAGENTS has %count( $rcrows ) rows\ 
5 but REACTIONS specifies $nrg reagents" \ 
"Bad REACTIONS or REAGENTS Data" ) 
return 
endif 

if $ACD!only__rs 

10 setvar svrows %set_unpack( "%set__and( %set__create( $rcrows ) \ 

$ACD!only_rs )" ) 
if %not( $svrows ) 

echo No reactant classes to be searched or built for Reaction $rxn 

endif 
15 else 

setvar svrows $rcrows 

endif 



if Sdosearch 

get_acd $rxn $rcts %set_create( $svrows ) 

20 endif 

if $dobuild 

trsl_acd $rxn $rcts %set_create( Ssvrows ) 
endif 

%file_delete( finish. pho ) >$nulldev 
25 photo on finish. pho >$nulldev 



# CSLN file generation is obsolete 
if $docsln 

csln_files $rxn Srcts %set_create( $rcrows ) 
endif 

30 if Sdocores 

cores $rxn $rcts %set_create( $rcrows ) 
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endif 
endif 

ACDIRxnUpdate $rxn $rcts 
endfor 
endfor 

uims ver off 
photo off 

#. 

@macro get_acd sybylbasic 
# 



# do reagent searches in ACD for all specified rows in reagents 
localvar fct rg sfrag buff bf hfname 

TABLE DEFAULT REAGENTS 
setvar rcrows %set_unpack( $3 ) 
for rg in $rcrows 

setvar sfrag %rcell( $rg SEARCH_SLN ) 

setvar hfname %cat( R $1 V $2 R %rcell( $rg ID ) ) 

setvar ofname %rcell( $rg SAME__AS ) 

if %streql( "$ofname" "?" ) 

setvar ofname 
endif 

if Sofname 

setvar ofname %substr( Sofname 1 %math( %pos( Sofname ) - 1 ) ) 
endif 

if %or( "$ACD!DoSearch" "%not( %file_exists( %cat( Shfname .hits ) ) )" ) 
if %and( "Sofname" " % file__exists( %cat( Sofname .hits ) )" ) 
# del /bin/cp %cat( Sofname .hits ) %cat( Shfname .hits ) 
else 
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10 



# prepare notlist file 

setvar notf %open( %cat( $hfname .bad ) "w" ) 
for not in %rcell( $rg NOTLIST ) 

if %file existsC $not ) ^x>rcNT 
, , „rt.e ou. ^> .aa fra,.e„. NOT COKTMNED SB.RCH HRAOMENT 
setvar bf %open( Snot "r" ) 
while %not( %eof( $bf ) ) 
setvar buff %read( $bf ) 

if %and( "%not( %eof( $bf ) )" \ 
"%not(%strecil("%substr("$bufriir"r))") 

setvar notin %not( %search2d( $sfrag \ 

"Sbuff NoTriv 1 y ) ) 
if %or( "Snotin" "%and( "%not( $notin )" \ 

IS "%2t( %sln atom_count( "Sbuff ) 1 )" ) ) 

% write( $notf $buff ) > $nulldev 

echo Not excluding Snot fragment Sbuff (contained in Ssfrag ) 

endif 
endif 
endwhile 
%close( $bf ) 

else 

%write( $notf $not) >$nulldev 

endif 
endfor 

%close( $notf ) 
# prepare query file 

setvar notf %open( %cat( $hfname .query ) w ) 
%write( $notf Ssfrag ) >$nulldev 
%close( $notf ) 

, do search (f.rs. «me for indiv.dua, components, second time to niter 



20 



25 



30 
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10 



umlticomponent cpds retrieved) 
echo .. Searching for %rcell( $rg SEARCH.SLN ) 
setvar dbs del $ACD!cmd -database $ACD!db -qfile \ 
%cat( $hfname .query ) -notlist \ 
%cat( $hfname .bad ) -hitlist tmp.hits -coords 

if $ACD!Test 

setvar dbs $dbs -maxhits 10 

endif 
$dbs 

setvar dbs del $ACD!cmd -database tmp.hits \ 

-dbtype sin -qfile %cat( Shfname .query ) \ 

-notlist %cat( Shfname .bad ) -hitlist %cat( Shfname .hits ) 



Sdbs 
endif 
15 endif 
endfor 



©macro trsl_acd sybylbasic 
# 

20 = = = = = = = = = = = = = '■ 



# prepare Xn files, ensure properties are recorded for all side chains 



globalvar ACD'.CycFrag 

localvar rcrows ma xls hfname how patin template xfile fname 
25 localvar f f 1 rg valences h nout XRgs allcrc crc 



setvar rcrows %set_unpack( $3 ) 
setvar ma Ml 
setvar ACD'.Xs 
setvar ACDIPool 
30 setvar xls 
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setvar xs 

TAILOR SET MAXIMIN2 MAXIMUMJTERATIONS 1000 
setvar split_atms 
setvar XRgs 
5 setvar supp 



ACD!INIT_Std_Topomer 



# reset CRC uniqueness checker 

%CRC_NOT_UNIQUE( junk junk) >$nulldev 

# for all reagents in this variant of this reaction 
10 for rg in $rcrows 

TABLE DEFAULT REAGENTS 

setvar nout 0 

setvar hfname %cat( R $1 V $2 R %rcell( $rg ID ) ) 
%file_delete( %cat( $hfname .pho ) ) >$nulldev 
15 photo on %cat( $hfname .pho ) > SnuUdev 

setvar xfile %rcell( $rg ATTACHED ) 
setvar ACD'.Xs %set_or( "SACD'.Xs" $xfile ) 
setvar XRgs[ $xfile ] $XRgs[ $xfile ] $rg 



setvar fname %cat( Shfname "." Sxfile ) 
setvar ofname %rcell( $rg SAME_AS ) 
if %streql( "Sofname" "?" ) 
setvar ofname 

endif 

setvar do_copy 
if Sofname 

if %not( %streql( "Sofname" "H.Xl" ) ) 

setvar p %substr( Sofname %math( %pos( 
%substr( Sofname 2 ) ) + 2 ) \ 
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%math( %strlen( Sofname ) - %pos( $ofname ) ) ) 
setvar rid %tblsrch_val( REAGENTS ID $p ) 
endif 

setvar do_copy %file^exists( $ofname ) 
5 endif 

# may only need to copy a previous version of *.Xn, if it's there 
if $do_copy 
else 

setvar fgpt_xtra %rcell( $rg FGPT_XTRA ) 
10 setvar uname %rcell( $rg USER_NAME ) 

setvar falign %open( %cat( $hfname ".fal" ) "a" ) 
setvar foracle %open( %cat( $hfname ".ora" ) "a" ) 



setvar how %rcell( $rg HOW ) 
if %not( Show ) 

15 echo No HOW specified for row $rg in REACTANT table 

goto nxtreactant 

endif 

setvar ACDlFixGeom %pos( CLIP "Show" ) 
setvar ACD!CycFrag 

20 setvar patin %rcell( Srg PRUNE_SLN ) 

setvar valences %rcell( $rg VALENCES ) 
for ats in %range( 1 $valences ) 
setvar xls $xls %cat( X $ats ) 
endfor 



25 setvar keep_ats 

setvzir xats 

if %gt( %count( $patin ) 1 ) 
setvar xats %arg( 2 $patin ) 
setvar keep_ats %arg( 3 Spatin ) 
30 setvar patin %arg( 1 $patin ) 
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endif 

setvar template %rcell( $rg TEMPLATE ) 
if Stemplate 

setvar split_atms %arg( 4 $template ) 

setvar CHOM!Align[ FIX_CF_CALLBACK ] %arg( 3 Stemplate ) 
setvar CHOM!Align[ SLN ] %arg( 2 Stemplate ) 
setvar template %arg( 1 Stemplate ) 
mol in m6 Stemplate >$nulldev 
CHOM!INIT_BUILD_3D M5 

endif 



setvar f %open( Sfname "w" ) 
if $fgpt_xtra 

%write( $f # %cat( FGPT_X= $fgpt_xtra ) ) >$nulldev 

endif 

if Suname 

%write( $f # %cat( USER_NAME= Suname ) ) >$nulldev 
endif 



setvar f 1 

# setvar fl %open( %cat( Shfname ".base." Sxfile ) "w" ) 
echo .. Translating hits for %rcell( Srg SEARCH_SLN ) 

if %set_and( HITS %set_create( %table_nameO ) ) 

echo ERror - HITS table already exists! 

return 
endif 

# read in the hitlist (it better be there!) and add price, FCD# columns 
TABLE CREATE hits unity $ma FROM_A_FILE \ 
%cat( Shfname .hits ) | >$nulldev 
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if %not( %set_and( HITS %set_create( %table_nameO ) ) ) 
echo No HITS exist for %rcell( $rg SEARCH_SLN ) ! 

else 

if $ACD!Price 

table column_append rdbms tcd_price first price 

table column_append rdbms tcd_suppliers first supplier 

table eval new * PRICE,SUPPLIER 
endif 

setvar args %table( * ROW NUM ) 
setvar wrote 1 

# processing all the hits 

for h in $args 
echo $h 

table default HITS 

setvar allsln %sln_get_sln_from_table( HITS $h ) 

# skip isotopically labelled reagents 

if %pos( "[I=" "Sallsln" ) 

echo Skipping isotopically labelled $allsln 

goto nxt_rxnb 
endif 

setvar pat %search2d( Sallsln $patin NoTriv 1 y ) 

# break up compound SLN into molecular components 

setvar p %pos( "." Sallsln ) 
while $p 

setvar allsln %substr( "Sallsln" 1 %math( $p - 1 ) ) \ 

%substr( "Sallsln" %math( Sp + 1 ) ) 
setvar p %pos( "." "Sallsln" ) 
endwhile 

it cycle through any components until we get the RELEVANT molecular component 
for cpsln in Sallsln 
setvar cpsln %fix_acd( Scpsln ) 
setvar pat %search2d( Scpsln Spatin NoTriv 1 y ) 
if Spat 
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setvar crc %sln_to_crc( $cpsln ) 
# check for within-hitlist duplicate of previous reagent providing same side chain 
if %CRC_NOT_UNIQUE( $crc ) 

echo Duplicate reagent SLN skipped \ 
5 (supplier $supp) Scpsln 

goto nxt_rxn 
endif 

DEFAULT $MA >$NUlldev 
%sln_to_mol( $ma $cpsln ) >$nulldev 
IQ if $ACD!FixGeom 

if %not( %chom_concord( $ma ) ) 
goto nxt_rxn 

endif 

endif 

15 setvar ats %acd_do_rxn( $ma $patin $how ) 

if %not( $ats ) 

goto nxt_rxn 

endif 

setvar nowsln %sln_labelx( $ma $xls ) 
20 setvar px %pos( X $nowsln ) 

# convert R's into X's (should probably be in C) 

while $px 

# check for isolated X's in ACD input 

if %not( %set_and( ••%substr( Snowsln \ 
25 %math( $px + 1) 1 )" 1,2,3,4,5,6,7,8,9 ) ) 

echo Input contains Isolated X -- reactant discarded 
goto nxt_rxn 
endif 

setvar nowsln %cat( %substr( Snowsln 1 %math( $px - 1 ) ) R \ 
30 %substr( Snowsln %math( Spx + 1 ) ) ) 

setvar px %pos( X Snowsln ) 
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endwhile 

# must ensure that every Rx is unique 
setvar ret 1 

setvar px %pos( %cat( R $rct ) Snowsln ) 

5 while $px 

setvar xs %set_or( "Sxs" %cat( X $rct ) ) 

setvar py %pos( %cat( R $rct ) \ 

%substr( Snowsln %math( $px + 1 ) ) ) 

while Spy 

setvar nowsln %cat( %substr( Snowsln 1 \ 

%math( Spy + Spx ) ) %math( Srct + 1 ) \ 
%substr( Snowsln %math( Spx + Spy + 2 ) ) ) 

setvar py %pos( %cat( R Srct ) \ 

%substr( Snowsln %math( Spx + Spy ) ) ) 

J 5 endwhile 

setvar ret %math( Srct + 1 ) 

setvar px %pos( %cat( R Srct ) Snowsln ) 

endwhile . 
, Check ag^n for wi.M„-hU«st dup.ica.e of previous reagen. provid.ng .r„e s.de charn 
20 setvar crc %sln_to_crc( Snowsln ) 

if %CRC_NOT_UNIQUE( Sere ) 

echo Duplicate side chain SLN skipped: Snowsln 

goto nxt_rxn 
endif 



n'i if SACD'.Price 

^ , .. /i:rr»-" %tabler Sh ROW NAME ) \ 
setvar nowsln %cat( Snowsln <FCD- % tables ^.n 

";PRICE=" %rcell( Sh PRICE ) ";SUPPLIER = " ) 

ff identify any preferred supplier present 

setvar supp %uppercase( %rcell( Sh SUPPLIER ) ) 
30 setvar supp %ACD_Get_Preferred_Supplier( Ssupp ) 

if %not( Ssupp ) 

setvar nowsln %cat( Snowsln ' ) 
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else 

setvar nowsln %cat( $nowsln $supp ) 
endif 
else 

5 setvar nowsln %cat( $nowsln " <FCD = " %table( $h ROW NAME ) ) 

endif 

# we have our SLN, now need to go off to RDATA to retrieve (find or generate) properties 

copy $ma M2 
default M2 >$nulldev 
10 # generate fragment(s) for identity search 

# NOTE - removal of reagent atoms may have split reagent up into independent fragments 

remove atom %set_create( %atoms( $xs ) ) >$nulldev 
setvar fsln %sln( M2 UNIQUE ) 
setvar p %pos( " < " $fsln ) 

15 if $P 

setvar fsln %substr( $fsln 1 %math( $p - 1 ) ) 

endif 

setvar p %pos( "." $fsln ) 
while $p 

20 setvar fsln %substr( "$fsln" 1 %math( $p - 1 ) ) \ 

%substr( "$fsln" %math( $p + 1 ) ) 
setvar p %pos( "$fsln" ) 
endwhile 

ft because there may be multiple fragments per reactant, must sum over 
25 ft these to get property values 
setvar tlogp 0 
setvar tmw 0 
setvar trb 0 
setvar tcmf 
30 ft cycle through 1 or more fragments .. 

# for each, search Oracle table via CRC for a previosu occurrence 
for sin in $fsln 
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# check for multiple binding of THIS fragment 
setvar ACD'.cycfrag 
if %gt( Svalences 1 ) 

%sln_to_mol( M4 $sln ) >$nulldev 
5 default M4 > SnuUdev 

setvar nat %mol_info( M4 NATOMS ) 
FILLVALENCE * H 1 1.09 1 1. 09 1 1.09 >$nulldev 
setvar ACD'.cycfrag %gt( %math( \ 

%mol_info( M4 NATOMS ) - $nat ) 1 ) 

IQ endif 

# if a fragment closes a ring, must use the input conformation 

if $ACD!cycfrag 

# identify the atoms to be extracted 

setvar cycpat %search2d( %sln( $ma ) $sln NoDup 1 y ) 
j5 setvar extract %set_create( %sln_rgroup_sybid( \ 

$ma $cycpat %range( 1 %sln_atom_count( $sln ) ) ) ) 
EXTRACT %cat( $ma ■■(" Sextract ")" ) M4 >$nulldev 
if %not( $ACD!FixGeom ) 

echo WARNING: Side Chains are joined \ 
2Q in a reactant $allsln but CLIP is not in HOW 

endif 

else 

%sln_to_mol( M4 $sln ) >$nulldev 
endif 

25 setvar sin %sln( M4 UNIQUE ) 

setvar ct 0 

slnmodified: 

setvar crc %sln_to_crc( $sln ) 
# find RDATA record -- have properties already if present 
30 if %not( %streql( %RDBMS_SetBindValue( \ 

$ACD!qprop NEW_CRC $crc ) TRUE ) ) 
echo RDBMS Set Bind VAlue failed -- quitting 
return 



10 



15 



20 



25 



30 
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endif 

setvar havel 

setvar matches %RDBMS_BindQuery( $ACD!qprop ) 

setvar EOQ 

while %not( $EOQ ) 

setvar rdata %RDBMS_ReadQuery( $ACD!qprop \ 

"%s %f %f %s %r ) 

if %RDBMS_ErrorO 
setvar EOQ true 
else 

• 1 . H N of anv < name= before checking for string match 
# trim previously stored SLN ot any «^iMinc 

setvar sln_noname %arg( 1 $rdata ) 
setvar p %pos( " < " $sln_noname ) 
if $p 

setvar sln_noname %substr( \ 

$sln_noname 1 %math( $p - 1 ) ) 

endif 

if %streql( $sln $sln_noname ) 
setvar havel TRUE 
break 

else 

echo Different structures have same CRC's - renaming 

setvar p %pos( Ssln ) 

if $p 

setvar sin \ 

%substr( Ssln 1 %math( $p - 1 ) ) 

endif 

setvar ct %math( $ct + 1 ) 

setvar sin %cat( Ssln " <name=:DUP" Set " > " ) 
goto sln__modified 
endif 
endif 
endwhile 
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# if fragment no. in Oracle mble. calculate, then store, fragment properties 
if %not( $havel ) 

echo Adding $sln to RDATA 
if %streql( SH $sln ) 

goto nxt_rxn 
endif 

setvar vals %ACDcalcprop( $sln $nia \ 
$valences $falign $split_atms ) 



10 



15 



20 



25 



if %not( $vals ) 

echo Physical data not calculable for $sln 

goto nxt_rxn 

endif 

setvar rdata $sln $vals %set_size( "$CHOM!AIign[ RBDS ] ) 
if %not( %rdbms_transactionstart( oracle_rdata ) ) 

echo RDMBS_TRANSACTIONSTART failed - quitting 

return 
endif 

# building SQL command to do Oracle INSERT _ 

setvar cmd %cat( "{" $crc $sln $valences , \ 
%arg( 1 $vals ) ",*" %arg( 2 $vals ) \ 
%arg(3$vals)"'," %set_size( \ 
"$CHOM!Align[ RBDS ]" ) 
•■ NULL,NULL,NULL,NULL,NULL)" ) 

setvar cmd insert into RDATA VALUES $cmd ; 
if %not( %rdbms transactionCommand( oracle_rdata " $cmd " ) ) 
echo Addition of side chain to Oracle RDATA table failed - 



Quitting 



30 



return 
endif 

if %not( %rdbmsJransactioncommit( oracle_rdata ) ) 
echo Transaction Commit failed -- quitting 
return 
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endif 
endif 



■r :o Knn T nverall value is NULL 
i.t<. T npn MW rotatable bonds - if any is NULL, overall v 
# accumulate Logp, mw, rout „ ,^ t^H^ta 1 99 99 ) 

setvar tlogp %ACD_add( $tlogp %arg( 2 $rdata ) 99.^^ ) 
setvar tmw %ACD_Add( $tmw %arg( 3 $rdata ) -1.0 ) 
setvar trb %ACD_Add( $trb %arg( 5 $rdata ) -1.0 ) 
if %and( -rcnotC %streql( "Stcmf NULL ) )" \ 

••%not( %streql( -%arg( 4 $rdata )" NULL ) )" ) 
setvar tcmf %cat( $tcmf %arg( 4 $rdata ) ) 



10 ^^^^ 



setvar tcmf NULL 
endif 
endfor 

# finished checking all fragments within a reagent from HITS 

15 # output side chain structure for CSLN construction on 1st pass only 
if$fl 

%write( $fl %substr( $nowsln 1 %math( %pos( \ 

"<•• Snowsln ) - 1 ) ) ) >$nulldev 
%close($fl ) >$nulldev 
2Q setvar f 1 

endif 

. • f«r vn file - Null values are represented by blanks 
# keep building output stnng for .Xn file 

setvar ACD'.SLN $nowsln 
ACDladdval MW -1.0 $tmw 
25 ACDladdval RBD -1 $trb 

ACD'.addval LOGP 99.99 $tlogp 
ACDladdval CTOPS NULL $tcmf STRCMP 



setvar nowsln %cat( $ACD!sln ">" ) 
%write( $f $nowsln ) >$nulldev 



227 

setvar nout %math( Snout + 1 ) 
setvar wrotel TRUE 

# write out data for future Oracle table matching RDATA to its uses in CSLN libraries 
if %not( Ssupp ) 
5 setvar supp NULL 

endif 

TABLE DEFA HITS 
setvar price %rcell( $h PRICE ) 
if %not( Sprice ) 
IQ setvar price NULL 

endif 

%write( $foracle $crc $1 $2 $rg %table( $h ROW NAME ) \ 
$PRICE $supp ) >$nulldev 
# only record first occurrence of a component containing the fragment 
15 break 
endif 

nxt__rxn: 

endfor 

if %and( "Swrotel" "$ACD!Tesf' ) 
20 break 
endif 

nxt_rxnb: 

endfor 
# finished all HITS !! 



25 if Stemplate 

ACD!INIT__Std_,Topomer 

setvar template 

endif 

%close( Sfalign ) 
30 %close( Sforacle ) 

%close( $f ) >$nulldev 
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ACDlrecord REAGENTS Sfname $rg VARIANTS UPDATED 
TABLE CLOSE hits NO >$nulldev 
echo $nout variations written to Sfname 
endif 
5 nxtreactant: 

photo off 
endif 
endfor 

# %rdbms_close( oracle_RDATA ) > SnuUdev 



10 #. 



©macro record ACD 



15 # count how many variations are referenced by the new CSLN 
if %not( $ACD!Test ) 
TABLE DEFAULT $1 
del "wc $2 >junk.txt" 
setvar f %open( junk.txt r ) 
20 setvar buff %read( $f ) 

echo %wcell( $3 $4 %arg( 1 $buff ) ) >$nulldev 
echo %wcell( $3 $5 "%time()" ) >$nulldev 
%close( $f ) >$nulldev 
setvar f %table_attribute( FILENAME ) 
25 echo SAVING $f 

%file_delete( $f ) >$nulldev 
TABLE SAVE $f 
endif 



30 ©macro RxnUpdate ACD 
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# count and save how many products 



10 
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20 
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if %not( $ACD!Test ) 
setvar nprod 1 

se^,ar xrg %tbUrch_val( REACTIONS CLASSJD $1 ) 
if $xrg 

if %eq( 1 %rcell( $xrg MORE.CORES ) ) 
table default cores 

setvar nprod %rcell( %tblsrch_val( \ 

CORES CLASSJD $1 ) VARIANTS ) 

if %not( Snprod ) acq in <S1 

echo No VARIANTS value for CORES file for CLASSJD $1 

return 
endif 
endif 
endif 

TABLE DEFAULT REAGENTS 
setvar ACD!Xs 
setvar XRgs 

for rg in %tblsrch_val( REAGENTS VARIATION $2 ) 
setvar x %rcell( $rg ATTACHED ) 
setvar ACD'.Xs %set_or( "$ACD!Xs" $x ) 
setvar XRgs[ $x ] $XRgs[ $x ] $rg 
endfor 

for x in %set_unpack( $ACD!Xs ) 
setvar nvar 0 
for var in $XRgs[ $x ] 

setvar nxvar %rcell( $var VARIANTS ) 
if %or( "%not( Snxvar )" "%lt( "Snxvar" 1 )" ) 
setvar nxvar %rcell( $var SAME_AS ) 
if %streql( "Snxvar" "?" ) 
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15 



TANTS 
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setvar nxvar 

endif 

if %not( $nxvar ) 

echo No variants value found or \ 

derivable for ID $var in REACTANTS 

return 
endif 

if %streql( $nxvar "H.xr ) 

setvar nxvar 1 
else 

setvar rg %pos( R %substr( $nxvar 2 ) ) 
setvar rg %substr( $nxvar %math( $rg + 2 ) \ 
%math( %pos( "." $nxvar ) - $rg - 2 ) ) 
setvar nxvar %rcell( %tblsrch_val( REAGENTS ID $rg ) \ 

VARIANTS ) 
if %not( $nxvar ) 

echo No variants value found or derivable \ 
for ID Snxvar in REAC 

return 
endif 
endif 
endif 

setvar nvar %math( Snvar + Snxvar ) 
endfor 

setvar nprod %math( $nprod * $nvar ) 
endfor 

TABLE DEFAULT REACTIONS 

echo Generated Snprod products 

echo %wcell( $xrg SIZE Snprod ) >Snulldev 

echo %wcell( Sxrg UPDATED "%timeO" ) >Snulldev 

setvar f %table_attribute( FILENAME ) 

echo SAVING $f 
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%file_delete( $f ) >$nulldev 
TABLE SAVE $f 
endif 

#. 

5 ©macro delval ACD 



# removes all instances of an attribute/value pair from an SLN 
10 globalvar ACD'.SLN 
localvar p pi 



setvarp %pos($l $ACD!SLN ) 
while $p 

setvar pi %pos( %substr( $ACD!SLN $p ) ) 

ic if %not( $pl ) 

setvar pi %pos( " > " %substr( $ACD!SLN $p ) ) 

endif 
setvar 



20 



~ . i» uof,/ "tArn'SLN" 1 %math( $p - 1 ) ) ^ 
ACDISLN %cat( %substr( iACU.ii^iN 

%substr( "SACD'.SLN" %math( $pl + $P ) ) 
setvar p %pos( $1 $ACD!SLN ) 
endwhile 



©macro addval ACD 



•, APnKiT N in UNITY format, checking 
# appends attribute value pair to ACDISLN 



# for input values which simulate null values 
globalvar ACDISLN 
30 localvar isnuU 
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# first remove all existing references/data 
ACD'.delval $1 

setvar ACD'.SLN %cat( $ACD!SLN ";" $1 " = " ) 

if %eq( $# 4 ) 
5 setvar isnuU %streql( $2 $3 ) 

else 

setvar isnuU %eq( $2 $3 ) 

endif 
if $isnull 

10 setvar ACDISLN %cat( $ACD!SLN ) 

else 

setvar ACD'.SLN %cat( $ACD!SLN $3 ) 
endif 



15 @expression_generator ACD_Add 
# 



# adds a new value and returns sum, or returns the supplied code for NIL 
20 # if either old or new value already codes for NIL 

# need to truncate values retrieved from Oracle DB 
setvar arg2 $2 

setvar p %pos( "." $arg2 ) 

if $p 

25 if %gt( %strlen( $arg2 ) %math( $p + 2 ) ) 

setvar arg2 %substr( $arg2 1 %math( $p + 2 ) ) 

endif 
endif 

if %streql( $arg2 $3 ) 
30 %retum( $3 ) 
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else 



%retum( %math( $arg2 + $1 ) ) 



endif 
return 



5 #. 



©expression 



_5enerator ACD_Get_Preferred_Supplier 



20 



10 # identify "best" supplier, edit name as needed 

localvar p prefs supp 

setvar prefs %set_and( "SI" $ACD!Preferred_SuppUer ) 

if $prefs 

# if ANY suppliers are preferred, pick the best 
^5 for p in %set„unpack( $ACD!Preferred.Supplier ) 

setvar supp %set_and( $p $prefs ) 

if $supp 

break 
endif 
endfor 

else 

# else just grab the first one 

setvar supp %arg( 1 %set_unpack( $1 ) ) 

if %streql( "-" "Ssupp" ) 
setvar supp 

endif 
endif 

# can't tolerate hyphens 

setvar p %pos( "-" "Ssupp" ) 

setvar supp %cat( %substr( Ssupp 1 %math( $p - 1 ) ) \ 
%substr( Ssupp %math( Sp + 1 ) ) ) 



25 
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endif 

%retum( $supp ) 



@expression_generator ACD_core_props 
5 # 



# generate physicochemical data 
table default RSCRATCH 
10 echo%wcell(ll%sln(Ml))>$nulldev 

TABLE EVAL ALL 1 MW 
, note that Xn each have an "AW" of 12.011 - back these out^ 
setvar mw %math( %rcell( 13)- %count( $* ) * 12.011 ) 
# replace Xn by Me groups for best LogP estimate 
15 setvar sin %sln( Ml ) 

setvar p %pos( "X" $sln ) 
while $p 

setvar sin %cat( %substr( $sln 1 %math( $p - 1 ) ) ^ 
CH3 %substr( $sln %math( $p + 2 ) ) ) 
20 setvar p%pos("X"$sln) 

endwhile 

echo%wcell(l iSsln) >$nulldev 
table eval all 1 CLOGP >$nulldev 
setvar logp %rcell( I 2 ) 
25 if %not( $logp ) 

echo LogP not calculated for $sln 

setvar logp 99.99 
endif 

%retum( "$mw $logp" ) 

30 tf. 



@expression_generator SybID2SLN 
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J^mmsm^f-Oaton. m .he SLN .ha. corresponds «, a SYBYL ID » 
5 sewar Brg %arg( 1 %se._unpac« 3 ) ) 

for i in %range< 1 %moUnfo< SI NATOMS ) ) „ « . , ) ) 

if %eq( Surg %arg( 1 %se<.unpack( %sta_rg.oup.syb,d( $1 $2 $. ) ) ) ) 

%retum( $i ) 
return 

IQ endif 
endfor 

#. 

©macro ACDinit sybylbasic 

# = = = = = = = = = = = = = = = 



# read 



In MSS's, ini.iate database location and dbsearch engine 



gtobaivar ACO,cn,d .COMb CHOM,A,ign ACD,ini.ed ACiSLNin ACO,S.Kou. 

setvar ACD'.db /common3/lawless/acd/acd_udb 
20 # other one is /ads/lawless/ ACD 

setvar ACDlcmd /home5/jilek/bin/dbsearch.ads 
set CGQ_timeout 0 

setvar TA RDBMS READ_TIMEOUT 50000 
, odd bond types get created, later overridden by Concord 

25 table recall reactions 
table recall reagents 
table recall cores 



ft Oracle setup 
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take/home8/lawless/tcd >$nulldev 

take /tmp_mnt/netysn/home4/cran,er/panlabs/synplan/rdata >$nulldev 

if %not( %RDBMS_Open( oracle_rdata ) ) 

echo could not open Oracle table: RDATA with side chain data 



20 



return 
endif 



L'ar ACDiqprop %RDBMS_Se.upQuery( orade_rdaU RDATA.DATA ) 



if %not( $ACD!qprop ) 

echo RDATA query could not be Setup 

10 return 
endif 

if $ACD!Price 

if %not( %rdbnis_open( oraclejcd) ) 

echo ACD Price Oracle table not opened 
15 return 
endif 

endif 

ACD!INIT_TOPOMER 



setvar ACDlSLNin N[+1K=0)(0[-1]) N[-f 1](=0)0[-1] 
setvar ACDlSLNout N(=0)(=0) N(=0)=0 



setvar ACDIinited TRUE 



25 ©macro INTT_TOPOMER ACD 
# initializes topomer calculations 
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globalvar ACDlTopInited ACD'.Sites 

if %not( $ACD!TopInited) 

table recall %cat( $DSERV_TB RSCRATCH ) m3 >$nulldev 

table CONF SLN 
5 setvar CHOM!Align[ DEBUG ] 

setvar CHOM!Align[ BUMPS ] 

setvar CHOM!Align[ ALICYC ] Alljrans 

setvar CHOM!Align[ CHARGE ] None 

setvar CHOM!Align[ MCORE ] M6 
10 setvar CHOM!Align[ ORIENT ] 

setvar CHOM!Align[ FITRMS ] 0.6 

setvar CHOM!Align[ ATTACHED ] 

setvar CHOM!Align[ CORE_SLN ] 

uims load %cat( $DSERV_TB chom_batch.core ) >$nulldev 



15 ACD!INIT_STD_TOPOMER 
set CGQ_timeout 0 

setvar ACD!Sites[FILE] $TA_DEMO/disco_file.dat 

setvar ACD!Sites[FILE]/view/sybBDFR4K/vob/src/sybyl/demo/disco_me.dat 



20 



25 



param modi >$nu«dev atom.def F F 4 TH F 9 1 .30 GREEN 0.0 N 

4 0 N N 3 12.63 18 16 F | 
parameter add bondjype C.3 0.2 1 NO 0.3 C.2 1 NO N.ar H 1 \ 

NO S.o2 N.3 1 NO S.o2 N.2 1 NO S.02 N.pl3 1 NO \ 

N.l H 1 NO S.o2 S.3 1 NO | | >$nulldev 
parameter add bondjength C.3 0.2 1 1.5 0.3 C.2 1 1.5 \ 

N.ar H 1 1.0 S.o2 N.3 1 1.5 S.o2 N.2 1 1.5 S.o2 \ 

N.pl3 1 1.5 N.l H 1 1.0 S.o2 S.3 1 1.6 | 1 >$nulldev 



endif 

setvar ACDlTopInited TRUE 
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©macro INrr.STDTOPOMERACD ..^p^ATE was supplied by a 

g (re)sets standard lopomer templaK ,nfo after a TEMPLATE was 

REAGENT 

CHOM!Align[ SLN ) NH=CHCH2Any 

™o. i„ m6 %ca,( SDSERV_TB a.,d,ne.mo.2 ) 

CHOM!Aagn[F.X.CF.CALLBACKlACD!AM,DTORS 
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setvar 

CH0M!1N1T_BUILD_3D M5 



@expression_generator tblsrch_val 

J^rfo"rms'a search by value within some column of an MSS, 
# returns space separated row IDs 

localvar rows 

20 table defa $1 
if %eq( $# 3 ) 
setvar rows %table( %cat( "{RANGE(" $2 \ 

setvar rows. ^ ci n nool ^ '">V ) ROW NUM ) 

%math( $3 - 0.0001 )","%math( $3 + 0.0001 ) )) ) 

, , to %'X " " $4 ••)}" ) ROW NUM ) 

setvar rows %table(%cat("{RANGE( $2 , $3 , )J 

endif 

%retum( "Srows" ) 
return 



25 



30 



@expression_generator ACD_DOJ 
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^^^^'^TZ 7- iLN'pattem, $3 and following are transformations 

# $1 = molecule area; $2 - 

, .Mch convert the reactant in $1 to its product form. 

# attachment point atoms being named by Xn 

# returns TRUE if all went well 



10 



globalvar ACD'.recnos 

localvar ma sin tsf recno ats atm fdl nx 



setvar ma $1 

DEFAULT $ma >$nulldev 
setvar sin $2 
shift 
15 shift 



^„arpa.%search2d(%sln($ma)$slnN0Duply) 

echo ACD_D0_RXN: $sln no. found >n %sln( $ma ) 

return 

20 endif 

, se. up mapping of SLN IDs ,o invariant BECNO's ^^^^ 
setvar ats %s,n rgroup_sybid( $n,a Spat %range( 1 %sin_a.o , 
for a.m in %range( . %sln.atom,coun,( Ssin ) ) 

setvar anew %arg, 1 %set_unpac« ""^^ '! ' ' ^ 

setvar ACD-recnol $atm 1 %a.on-_into( Sanow RECNO ) 



endfor 



setvar nx 0 
ff execute reaction, step-by-step 

for tsfm in $* 



5 



10 



15 
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setvar isfm %set_unpack( $tsfm ) 
switch %uppercase( %arg( 1 Stsfm ) ) 

case ATYPE) $ACD'recno[ %arg( 2 $tsfm ) ] ) \ 

modify atom type %recno_to_id( $ma 3,ACU.recnui 

%arg( 3 Stsfm) 11.5 11.5 11.5 1 1.5 >$nulldev 

case BREAKS) 

setvar al %recno to_id( $ma $ACD!recno[ %arg( 2 Stsfm ) ] ) 
setvar a2 %recno_to_id( $ma $ACD!recno[ %arg( 3 Stsfm ) ] ) 
setvar bond %bonds( %cat( Sal " = " $a2 ) ) 
if Sbond 

switch %bond_info( Sbond TYPE ) 

case 1) 
case am) 

remove bond Sbond >$nulldev 



20 



case 2) 
case ar) 

modify bond type Sbond 1 >$nulldev 

endswitch 
else 

echo ACD^DO^RXN: $tsfm but no bond exists 
return 
25 endif 

case SPLIT) 

setvar al %recno_to_id( $ma $ACD!recno[ %arg( 2 Stsfm ) ] ) 
setvar a2 %recno_tojd( Sma $ACD!recno[ %arg( 3 Stsfm ) ] ) 
30 SPLIT Sal Sa2 >Snulldev 

°'ele a.o™ %rec„c_.oJd( S.a SACD.recnd 2 S.sf™ , 1 ) >Snunaev 
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*ec„o .oJd( Sma SACDIrecno, %a.g( 2 Suta „ ) N 

3 Stsfn, ) 1 1.5 1 1.5 1 1.5 ) 
5 seu.ar ACDIrecnot ?Sarg( 4 Stsfm ) 1 %a.om.mfo( SNEW.ATOM. 

'''' TdTom %recno_toJd( $ma $ACD!recno[ %arg( 2 Stsfm ) ] ) \ 
%arg( 3 Stsfm ) 1 1.5 >$nulldev 

• tisTPW ATOM ID REGNO ) 
10 setvar ACD!recno[ %arg( 4 Stsfm ) ] %atom_mfo( $NEW_ATO _ 

case MARKX) 

setvar nx %math( Snx + 1 ) 
setvar aname %arg( 3 Stsfm ) 
15 if %not( Saname ) 

setvar aname %cat( X Snx ) 

endif 

if %gt( %count( %atom_info( %recno_to_id( Sma \ 
$ACD!recno[ %arg( 2 Stsfm )])))!) 
20 echo WARNING: Multivalent attachment atom in %sln( Sma ) 

Tly atom name %recno_to_id( Sma SACD!recno[ %arg( 2 Stsfm ) ] ) ^ 
Saname >Snulldev 

25 case MAKEB) 

setvar al %recno to id( Sma SACD!recno[ %arg( 2 Stsfm ) 1 ) 
setvar a2 %recno_to_id( Sma $ACD!recno[ %arg( 3 Stsfm ) ] ) 
setvar bond %bonds( %cat( Sal Sa2 ) ) 



30 



if Sbond 

switch %bond_info( Sbond TYPE ) 

case 1) 
case am) 

modify bond type Sbond 2 >Snulldev 
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case 2) 

modify bond type $bond 3 >$nulldev 



' ' echo ACD.D0_RXN: Stsf. now has type: %bond.info( Sbond TYPE ) 

return 

endswitch 

10 else 

add bond $al $a2 1 1.5 >$nulldev 

endif 
case CLIP) 

15 # prune some atoms in recognition SLN 

, use remaining atoms in recognition SLN to control mappmg of 
ff reactant side chains to product side chains 
setvarlp%pos("(" "$tsfm") 
setvar rp %pos( "$tsfm" ) 
20 if %or( " %not( $lp )" " %not( $rp )" ) 

echo Missing parentheses in CLIP command 

return 
endif 

setvar ats w d> ci^ i ^ ^ 

25 for a. in %subsT( -Sufm" %ma.M $lp + 1 ) Sn. - $lP " D ) 

setvar ats $ats «sln_rgroup_sybi<i< $nna Spat $at ) 

endfor 
setvar rs 

for at in %substr( "$tsfm" %math( $rp + 1 ) ) 

setvar rs $rs %atom_info( %arg( 1 %set_unpack( \ 

%sln_rgroup_sybid( $ma $pat Sat ) ) ) RECNO ) 



30 

endfor 



enutor .. , 

II «tatc in ats EXCEPT for those directly 
# following routine: removes all $ats in ats bxci^f 
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, otherwise $rs is to contain RECNO's (invariant after deletions) for all 

# attached atoms NOT removed. 

%chom_rmv_ats( %cat( $ma "(" %set_create( $ats ) •')" ) $rs ) 

5 ;; 
case ) 

echo ACD_DO_RXN: Unknown HOW operator: $tsfm 
return 

10 endswitch 
endfor 

%retum( $nx ) 
return 
#. 

15 ©macro FIX_FUSE ACD 

# = = = = = = = = = = = 

Js^p^ific'c'al^ack for aligning topomer confs of tryptanthrin variants 
20 # ensure that NH=CH-CH2-C bond is 180 degrees and CH-CH2-C:C 
# is 0 before FIT is done regardless of what Concord did to it. 

localvar a 

setvar a %set_unpack( $2 ) 
modify torsion %arg( 1 $a ) %arg( 3 $a ) \ 
25 %arg( 5 $a ) %arg( 8 $a ) 180 > Snulldev 

modify torsion %arg( 3 $a ) %arg( 5 $a ) \ 
%arg( 8 $a ) %arg( 10 $a ) 0 > Snulldev 

©macro AMID_TORS ACD 
30 # 
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# default callback, ensures that NH=CHCH2Any torsion is set to 180 

# (minimization will change it) so that MATCH can work 

localvar a 

setvar a %set_unpack( $2 ) 

modify torsion %arg( 1 $a ) %arg( 3 $a ) \ 

%arg( 5 $a ) %arg( 8 $a ) 180 >$nulldev 



@expression_generator ACDcalcprop 
10 # 



# calculates physical properties of a previously unknown side chain 

# logP, MW, topmer field (via call to CHOM!THis_Build_3D for conformer) 
15 # uses RSCRATCH as workspace MSS 

# 

globalvar ACD'.CycFrag 
localvar split_atms buildhow 

TABLE DEFAULT Rscratch 
20 TABLE CONFORMER SLN 

# set up NULL values so we can tell if calculation failed 
echo %wcell( 1 CLOGP 99.99 ) >$nulldev 
echo %wcell( 1 MW -1.0 ) >$nulldev 

25 > $nulldev 

TABLE EVAL ALL 1 TOPOMERIC 



# molecular weight for frag as is 

echo %wcell( 1 SLN $1 ) >$nulldev 
TABLE EVAL ALL 1 MW >$nulldev 
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# LogP is for structure with H instead of open valence 

# echo %sln_to_inol( M4 $1 ) >$nuUdev 
default M4 >$nulldev 

setvar nat %mol_info( M4 NATOMS ) 
5 # fix bad S=0 typing 

if %search2d( $1 S=0 NoDup 1 y ) 

for pat in %search2d( $1 0=S=0 NoDup 0 y ) 

modify atom type %sln_rgroup_sybid( M4 Spat 2 ) \ 
S.o2 1 1.5 >$nulldev 

10 endfor 

for pat in %search2d( $1 0=S[F]Any NoDup 0 y ) 
modify atom type %sln_rgroup_sybid( M4 Spat 2 ) \ 
S.o 1 1.5 >$nulldev 

endfor 

15 endif u i to 

# following replaces (and greatly simplifies) code that is believed to be obsolete 

FILLVALENCE * H 1 1.09 1 1.09 1 1.09 >$nulldev 
if %not( %gt( %mol_info( M4 NATOMS ) $nat ) ) 

echo ERROR: NO unfilled valences in new fragment $1 

20 return 
endif 

modify atom name $NEW_ATOM_ID XI >$nulldev 

echo %wcell( 1 SLN %sln( M4 ) ) >Snulldev 
TABLE EVAL ALL 1 CLOGP >$nulldev 
25 # should check result here and go to simpler evaluation if CLOGP fails 

, Add aligning group for Topomeric, to be found m SCHOM!Align[ MINIT ] 
JOIN %cat( ••M4(" %atoms( XI ) ")" ) \ 

%cat( $CHOM!Align[ MINIT ] "(6)" ) 1 1.54 >$nulldev 



setvar cfa 
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setvar CHOM!AUgn[ ALICYC ] Alljrans 
setvar buildhow CONCORD 
if SACD'.CycFrag 

setvar buildhow NOBUILD 
5 setvar CHOM!Align[ ALICYC ] None 

endif 

if %CH0M_THIS.BUILD_3D( M4 Sbuildhow $1 A ) 
, remove aligning group before saving & doing CoMFA 

sewar pa. %search2d( %sln( M4 ) $CHOMialign[ SLN 1 NoTrrv y ) 

10 if $5 

# need to save recnos of standard split before doing custom split 

setvar split_atms %atom_info( \ 

%sln_rgroup_sybid( M4 $pat 8 ) RECNO ) \ 
%atom info( %sln_rgroup_sybid( M4 Spat 5 ) RECNO ) 
SPLIT %srn_rgroup_sybid( M4 Spat %set_unpaclc( S4 ) ) >Snulldev 
SPLIT %recno_to_id( M4 %arg( 1 $split_atms ) ) \ 

%recno_to_id( M4 %arg( 2 Ssplit_atms ) ) > Snulldev 

else 

SPLIT %sln_rgroup_sybid( M4 Spat 8 5 ) > Snulldev 

20 endif 

# evaluate and save CoMFA field 

setvar fsln %cat( %sln( M4 FULL ) ) 
echo %wcell( 1 SLN Sfsln ) > Snulldev 
%write($3 Sfsln) > Snulldev 
25 TABLE CONF SLN 

TABLE ENTER CELL 1 TOPOMERIC NO NO > Snulldev 
TABLE EVAL ALL 1 TOPOMERIC > Snulldev 
setvar cfa %rcell( 1 TOPOMERIC ) 



15 



else 



30 



setvar cfa NULL 



endif 
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# round up and return all the results 
setvar logp %rcell( 1 CLOGP ) 
setvar mw %rcell( 1 MW ) 
if %not( %streql( "$cfa" NULL ) ) 
5 if %eq("$cfa" 1.00) 

setvar cfa NULL 
else 

setvar cfa %comfa_hex( 1 TOPOMERIC ) 
endif 

10 endif 

%retum( "$logp $mw $cfa" ) 



©expression _generator FIX_ACD 
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# does string search/replace for groups - specifically nitro 
globalvar ACDlSLNin ACD'.SLNout 
localvar ans p arg ct 
20 setvar ans $* 
setvar ct 1 

for arg in $ACD!SLNin 
setvar p %pos( "$arg" "Sans" ) 
while $p 

setvar ans %cat( %substr( "$ans" 1 %math( $p - 1 ) ) \ 
%arg( Set SACD'.SLNout ) %substr( "$ans" \ 
%math( $p + %strlen( $arg ) ) ) ) 
setvar p %pos( "$arg" "$ans" ) 
endwhile 

30 setvar ct %math( $ct + 1 ) 
endfor 

%retum( "Sans" ) 
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#. 

@macro cores sybylbasic 
# 



^ hit list of core reactanst into a hit list with cores, properties 



# Converts a 

# The side chains will be identical to those in some prototype rxn 
localvar f buff files cct fct core_sln fcore weird xweird 



10 setvar Xs 
setvar Xlist 
setvar xls 

setvar weird Na K Ca 



setvar fcore %cat( R $1 V $2 ) 
15 TABLE DEFAULT REACTIONS 

setvar vars %tblsrch_val( REACTIONS CLASSJD $1 ) 

setvar how_core %eq( 1 ••%rcell( Svars MORE_CORES )" ) 

setvar coreflag NO 

if $how_core 
20 setvar coreflag YES 

endif 



setvar fout %open( %cat( Sfcore ".cores" ) "w" ) 
if %not( $ACD!NoCat ) 

setvar rx %rcell( $vars CLASSJD ) 

setvar uname %rcell( $vars NAME ) 

if %not( %eq( 1 %count( Suname ) ) ) 

echo Not a one-word reaction NAME in row $vars : Suname 

return 
endif 
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echo Preparing %cat( $fcore ".files" ) 
TABLE DEFAULT REAGENTS 
setvar rcrows %set_unpack( $3 ) 
for rg in $rcrows 
5 setvar x %rcell( $rg ATTACHED ) 

setvar Xs %set_or( $x "$Xs" ) 

setvar Xlist[ $x ] $Xlist[ $x ] $rg 
endfor 

setvar f %open( %ca.( Sfcore • .flies" ) "w" ) 
10 # following generates al. combinations of all calls (no recursion ,n SPL, 
setvar npos %set_size( $Xs ) 
setvar n2make 1 

for nx in %sort( %set_unpack( $Xs ) ) 

setvar smax[ $nx ] %count( $XList[ $nx ] ) 
15 setvar n2make %math( $n2make * $smax[ $nx ] ) 
setvar idx[ $nx ] 1 
endfor 

for i in %range( 0 %math( $n2make - 1 ) ) 
setvar idx %cat( R $rx "." ) Scoreflag \ 
20 $uname %cat( $fcore .cores ) 

setvar base $i 
# establish indexes at each position 
for j in %set_unpack( $Xs ) 

setvar rg %arg( %math( ( Sbase % $smax[ $j ] ) + D \ 
25 $XList[ $j ] ) 

setvar rf %rcell( $rg SAME_AS ) 
if %and( "Srf "%not( %streql( "Srf ) )" ) 

setvar idx Sidx $rf 
else 

30 setvar idx $idx %cat( $fcore R %rcell( $rg ID ) \ 

%rcell( $rg ATTACHED ) ) 

endif 

setvar base %math( $base / $smax[ $j ] ) 
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endfor 

%write( $f$idx) >$nulldev 
endfor 
%close( $f ) 
5 endif 

# now recover additional cores, if needed 



10 



if $how_core 

setvar cvars %tblsrch_val( CORES CLASSJD \ 

%rcell( $vars CLASSJD ) ) 
setvar how_core %rcell( Scvars HOW_CORE ) 



setvar valences %rcell( $cvars VALENCES ) 
setvar xls 

for ats in %range( 1 $valences ) 

setvar xls $xls %cat( X $ats ) 

15 endfor 

setvar core_sln %rcell( $cvars MORE_CORE ) 
if %not( $core_sln ) 

echo No MORE_CORE for reaction $vars 

return 

20 endif 

setvar xrlist %rcell( Scvars XRLIST ) 

if %not( %eq( %count( Sxrlist ) $VALENCES ) ) 

echo mismatch between VALENCES and XRLIST for reaction Svars 

return 
25 endif 

setvar opat %string_insert( %string_inserl( %rcell( Scvars XRCORE ) \ 

%arg( 1 Sxls ) %arg( 1 Sweird ) ) \ 

%arg( 2 Sxls ) %arg( 2 Sweird ) ) 
setvar xrcore %rcell( Scvars XRCORE ) 
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core get_acd $1 $2 $3 

setvar fhits %cat( $fcore core. hits ) 
# start processing hits 

if %not( %file_exists( $fhits ) ) 
5 echo $fhits (hitlist of core reactants) not found 

return 
endif 

setvar cct 1 

TABLE CREATE hits unity M5 FROM_A_FILE "Sfhits" | >$nulldev 

10 if $ACD!Price 

table column_append rdbms tcd_price first price 
table column_append rdbms tcd_suppliers first supplier 
table eval new * PRICE,SUPPLIER 
endif 

15 %CRC_NOT_UNIQUE( junk junk ) > $nulldev 
setvar choices %table( * ROW NUM ) 
else 

setvar choices %arg( 1 %rcell( $vars CORE_SLN ) ) 
endif 

20 for h in $choices 
if $how_core 
table default HITS 

setvar allsln %sln_get_sln_from_table( HITS $h ) 
else 

25 setvar allsln $h 

endif 

# cycle through RELEVANT molecular component 
setvar p %pos( "." Sallsln ) 
while $p 

30 setvar allsln %substr( "Sallsln" 1 %math( $p - 1 ) ) \ 

%substr( "Sallsln" %math( $p + 1 ) ) 
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setvarp %pos("." "Sallsln" ) 
endwhile 
for cpsln in Sallsln 
setvar cpsln %fix_acd( $cpsln ) 



5 if $how_core 

setvar pat %search2d( $cpsln $core_sln NoDup 1 y ) 

if %not( $pat ) 
break 

endif 

10 setvar crc %sln_to_crc( Scpsln ) 

if %CRC_NOT_UNIQUE( $crc ) 

echo Skipping duplicate Scpsln 
break 

endif 

15 if %pos( "[I = " "Scpsln" ) 

echo Isotope skipping Scpsln 
break 
endif 

echo Core Sect -- Scpsln 



20 



%sln_to_mol( Ml Scpsln ) >Snulldev 
if %not( %acd_do_rxn( ml $core_sln Show_core ) ) 
goto nxt core 

endif 

setvar outsln %sln_labelx( ml Sxls ) 



25 n build XRLIST 

setvar osln %string_insert( %string_insert( \ 

Soutsln %arg( 1 $xls ) \ 

%arg( 1 Sweird ) ) %arg( 2 Sxls ) %arg( 2 Sweird ) ) 
setvar patx %search2d( Sosln Sopat NoDup 1 y ) 
30 if %not( Spatx ) 
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10 



echo $opat not found in $osln - skipping core 

goto nxt__core 

endif 
setvar xrl 
for X in $xrlist 

setvar x %set_unpack( $x ) 

if $xrl 

setvar xrl %cat( $xrl ) 
endif 

setvar xrl %cat( $xrl %SLN_ID( $patx %arg( 2 $x ) ) 

%arg( 1 $x ) " = •' %SLN_ID( Spatx %arg( 3 $x ) ) ) 

endfor 



# is core symmetric? 
setvar sym 0 
15 %sln_to_mol( M2 $osln ) > SnuUdev 

%sln_to_mol( M3 %string_insert( %string_insert( \ 
$outsln %arg( 1 $xls ) \ 
%arg( 2 $weird ) ) %arg( 2 $xls ) \ 
%arg( 1 $weird))) >$nulldev 
20 if %streql( %sln( M2 UNIQUE ) %sln( M3 UNIQUE ) ) 

setvar sym 1 

endif 
else 

setvar outsln $cpsln 
25 setvar sym 0 

setvar xrl %arg( 2 %rcell( $vars CORE_SLN ) ) 

endif 

### 

m At this point Soutsln is the SLN with XI, X2, etc for the 
30 variation sites. 
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m calculate number of rotatable bonds WITHOUT XI, X2 attachment 

m points. 

setvar newslnl Soutsln 
5 setvar ct 1 

setvar offset 2 

setvar pi %pos( %cat( X $ct ) Snewslnl ) 
while $pl 

setvar newslnl %cat( %substr( Snewslnl 1 %math( $pl - 1 ) ) \ 
10 %substr( Snewslnl %math( $pl + $offset ) ) ) 

setvar ct %math( $ct + 1 ) 
if %eq( Set 10 ) 
setvar offset 3 
endif 

15 setvar pi %pos( %cat( X $ct ) Snewslnl ) 

endwhile 

setvar scratch_molarea %molempty() 
%sln_to_mol( $scratch_molarea Snewslnl ) >Snulldev 

setvar old_default Sdefault_area 
20 default $scratch_molarea >$nulldev 

setvar bds %set_create( %bonds( (*-{RINGSO})&< 1 > ) ) 
setvar mval %set_create( \ 
%atoms( 

<H> + <o.2>-.<F> + <I> + <Cl> + <Br>^-<n.l> + <LP> + <Du>)) 
25 setvar pds %set_create( %bonds( %cat( "{TCATOMSC' Smval ")}" ) ) ) 

setvar bds %set_diff( Sbds Spds ) 
if Sbds 

setvar bds %set_size( Sbds ) 
else 

30 setvar bds 0 

endif 

zap Sscratch_molarea 

default $old_default > Snulldev 
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m Soutsln can also be sent to acd_core_propsl to generate MW and CLOGP 

m 

setvar props %ACD_Core_Propsl( Soutsln ) 



5 m 

m Change all X into Y_0 

m 

setvar ct 1 
setvar ypfx Y_0 
10 while TRUE 

if %pos( %cat( X $ct ) Soutsln ) 
setvar outsln %string_insert( \ 

Soutsln %cat( X $ct ) %cat( $ypfx $ct ) ) 

else 

15 break 
endif 

setvar ct %math( Set + 1 ) 
if %eq( $ct 10 ) 
setvar ypfx Y_ 
20 endif 
endwhile 



if Show core 

TABLE DEFAULT HITS 
25 setvar sin %cat( Soutsln " <FCD = " %table( $h ROW NAME ) \ 

••;PRICE=" %rcell( $h PRICE ) ";SUPPLIER = " %uppercase( \ 
%ACD_Get_Preferred_Supplier( %rcell( $h SUPPLIER ) ) ) \ 
"•MW = " %arg( 1 Sprops ) ";RBD = " Sbds ";LOGP = " \ 
%arg(2 Sprops) ";SYM = " Ssym ";XRLIST=" Sxrl ' >") 

30 ^Ise ^ , 

sewar sin %ca« Soutsln " <MW=- %arg( 1 Sprops ) ■;RBD = - Sbds ^ 
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";LOGP = " %arg( 2 $props ) ••;SYM = " $sym ";XRL1ST= 
$xrl ">") 

endif 

%write( $fout $sln ) >$nulldev 
5 if $ACD!Test 

goto alldone 
endif 
nxt_core: 

setvar cct %math( $cct + 1 ) 

10 break 
endfor 
endfor 



alldone: 

%close( $fout) >$nulldev 

15 if $how_core 

TABLE CLOSE hits NO >$nulldev 
ACD'.Record CORES %cat( $fcore " 
endif 



" ) $cvars VARIANTS UPDATED 



20 @expression_generator stringjnsert 




setvar p %pos( $2 $1 ) 

if $p 
if $3 

setvar ans %cat( "rosubstrC $1 1 %math( $p - 1 ) )" \ 

$3 "%substr( $1 %math( $p + %strlen( $2 ) ) )" ) 

%retum( Sans ) 
else 

setvar ans %cat( "%substr( $1 1 %math( $p - 1 ) )" \ 
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•'%substr( $1 %math( $p + %strlen( $2 ) ) )" ) 
%retum( $ans ) 
endif 
else 

5 %retum( $1 ) 

endif 
#. 

@expression_generator ACD_extract_ridX 

# ^^ = = = = = = = = = = 

# backs out the row id and X from the input file name 

# get rid of first few chars 
setvar arg %substr( $14) 

15 setvar r %pos( R $arg ) 
setvar p%pos("."$arg) 

%retuni( %set.create( %substr( $arg %math( $r + 1) %math( $p - $r - 1) ) \ 
%substr( $arg %math( $p + 1 ) ) ) ) 

#. 

20 ©macro core_get_acd sybylbasic 

# ^^^^ = = = = = 

# do reagent searches in ACD for all specified rows in reagents 
25 localvar fct rg sfrag buff bf hfname 

setvar rg %tblsrch_val( CORES CLASSJD $1 ) 
setvar sfrag %rcell( $rg MORE_CORE ) 
setvar hfname %cat( R $1 V $2 core ) 



10 



15 



20 
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if %or( "SACDlDoSearch" "%not( %me_exists( %cat( Shfname .hits ) ) )" ) 

# prepare notlist file 

setvar notf %open( %cat( Shfname .bad ) "w" ) 
for not in %rcell( $rg CORE_NOTLIST ) 
if %file_exisis( $not ) 
, .rite ou. ^1 ba^ fragment NOT COKTAmED by SEARCH FRAGMEKT 
setvar bf %open( $not "r" ) 
while %not( %eof( $bf ) ) 
setvar buff %read( $bf ) 

if %and( -rcnotC %eof( $bf ) )" "%not( %streql( \ 
••%substr("$buff 11 )""#"))") 
if %not( %search2d( $sfrag "Sbuff NoTriv 0 y ) ) 
%write( $notf $buff ) >$nulldev 

else 

echo Not excluding $not \ 

fragment $buff (contained in Ssfrag ) 

endif 
endif 
endwhile 
%close( $bf ) 



25 



else 

%write( $notf Snot ) >Snulldev 

endif 
endfor 

%close( Snotf ) 
# prepare query file 

setvar notf %open( %cat( Shfname .query ) "w" ) 
%write( Snotf Ssfrag ) >SnuUdev 
%close( Snotf ) 



30 # do search (first time for individual components, 

# second time to filter umlticomponent cpds retneved) 

echo .. Searching for Ssfrag 
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setvar dbs del $ACD!cmd -database $ACD!db -qfile \ 

%cat( $hfname .query ) -notlist %cat( $hfname .bad ) \ 
-hitlist tmp.hits -coords 
if $ACD!Test 
5 setvar dbs $dbs -maxhits 10 

endif 
$dbs 

setvar dbs del $ACD!cmd -database tmp.hits -dbtype sin -qfile \ 
%cat( Shfname .query ) -noUist %cat( $hfname .bad ) \ 
-hitlist %cat( Shfname .hits ) 

$dbs 
endif 
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Ap ppndix "F" 



/*E+ SYB MGEN GPLS_COMFA_HEX */ 
5 * int SYB MGEN_GPLS_COMFA_HEX( identifier, nargs, args, writer ) 

* 

* Expression generator that returns hex version of a fingerprint 

* 

* 

* interface: 

* %comfa hex(Row ( CoMFA_col) 

* with Row being a row to dump 

CoMFA.col being a column selection for the topomer fingerprint* 

* handles steric field or if 3 args electrostatic 

* , converts fpt to 4 bits 

* 

int SYB_MGEN_GPLS_COMFA_HEX(identif.er, nargs, args, writer ) 
char *identifier; 
20 int nargs; 
char *argsQ; 
PFI writer; 

{ 

int row, type, present; 
25 int err, i; 
set_ptr ref; 

ROWCOL_SEL_PTR row_sel; 

char *dum, *cname, *pamame, *table; 

FieldPtr ofield; 

30 ComfaMolPtr cmp; 

if (1 LM_ACCESS_CHECK_CmpdSel("CmpdSel","CmpdSel") ) 

{ UBS_OUTPUT_MESSAGE(stdout,"This requires a license to CmpdSelAn"); 



261 

return 0; } 
if (nargs < 2 | nargs > 3 ) 

{ 

UIMS2_WRITE_ERROR( 
5 "Error: %comfa_hex (Row PrintCol (field 2 ) )\n" ); 

return 0; 

} 

/* get the column */ 

if (.(uble=TSH APUNT_GET.DEFAULT_TABLEO ) ) goto badcol; 
,0 if (KUIMS2 VARTYPE.CALC.VALUE("COL_SEL-,argslll, toow.sel)) 1 1 

,TBL ACCESS INDEX TO_COLNAME( table , row.sel->i<i -1, Rename ) , , 
mL'.ATTR^SAMPLE.COLUMN.ACuble, cname, "FIELD", &du,n. &presen,) 

I I Ipresent) 

{ UBS_OUTPUT_MESSAGE(stdout,"Not a valid CoMFA columnAn"); 
15 goto badcol; } 

/* get the reference row */ 
if (!(UIMS2_VARTYPE_CALC_VALUE("ROW_SEL",args[0], &row_sel)) , , 

!TBL_ACCESS_X_GET_VALUE(table, row_sel->id -1, cname, 
"CELL_SUPPORT", (int *)&cmp, &err ) ) 

20 { 

UIMS2_WRITE_ERR0R( 
"Error: Invalid reference row selection for %fp_hex\n" ); 

return 0; 

25 ^ ifdcmp I I '.(ofield = (nargs = =^ 3) ? cmp- > efldj, : cmp- > sfld_p) ) { /* 

the data is not there */ 

UBS_OUTPUT_MESSAGE(stdout,"Not a valid CoMFA cellAn ); 

goto badcol;} 
dum = UIMS2_MessageBuffer; 
30 for (i=0;i<ofield->n_points ;i + + , dum + = 1 ) 

sprintf(dum, % . IX" , lookup_my_comfa_code(ofield- > field_valueM) ); 



262 



(*writer)( UlMS2_MessageBuffer ); 



badcol: 



return 1; 

I: 

UIMS2_WRITE_ERROR( 
"Error: Invalid column selection for %comfa_hex\n" ); 

return 0; 



} 



int lookup_my_comfa_code(value) 
^t value; 

)ffll6] = (9999., 0., 2., 4., 6., 8., 10., l/-, 
14., 16., 18., 20., 22., 24., 26., 30. 



static fpt cutof 



}; 

int i; 

15 if (!DABS_DUT_OKDATA(value)) return 0; 

for (i = l;i< 16;i++) if (value < = cutoff[i]) return i; 
UBS_OUTPUT_MESSAGE(stdout,"Invalid field value above 30.0 set to 

missing. \n"); 
return 0; 

20 } 



/*E+:SYB_MGEN_GPLS_FP_HEX */ 



********************* 



************************** 



* int SYB 



MGEN GPLS_FP_HEX( identifier, nargs, args, writer ) 



25 



* Expression 



generator that returns hex version of a fingerprint 



* interface: 



30 * %fp_hex(Row (Finger_col) 

* with Row being a row to dump 
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Finger_col being a column selection for the fingerprint 

int SYB_MGEN_GPLS_FP_HEX(identifier, nargs, args, writer ) 
5 char *identifier; 
int nargs; 
char *argsQ; 
PFI writer; 

{ 

10 int row, type, present; 
int err, i; 
setjjtr ref; 

ROWCOL_SEL_PTR row_sel; 
char *dum, *cname, *pamame, *table; 
15 if V LM.ACCESS CHECK_CmpdSel("CmpdSel","CmpdSel") ) 

{ uBS_OUTPUT_MESSAGE(stdout,"This requires a license to CmpdSelAn ); 

return 0; 
if (nargs ! = 2 ) 

{ 

20 UIMS2_WRITE_ERROR( 

"Error: %fp_hex (Row PrintCol )\n" ); 
return 0; 

} 



/* get the column */ 
25 if (.(tabIe=TSH APLINT_GET_DEFAULT_TABLEO ) ) goto badcol; 

if (!(UIMS2_VARTYPE_CALC_VALUE("C0L_SEL",args[l], &row_sel)) i 
!TBL_ACCESS_INDEX_TO_COLNAME( table , row_sel->id -1, 
&cname )) 



goto badcol; 

30 if (! TBL_UTL_COL_TO_FUNCTION(tab]e, cname, &pamame)) 
goto badcol; 

if (!TBL_ATTR_FIND_COLUMN_A ( table, pamame. 
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"TYPE", &dum, &type )) 

goto badcol; 
type = TBL_IO_TYPE_TO_KEY( type ); 

if Mvoe ' = PROC V PRINT && 

,SL.ATrR.s"AMPLE.COLUMN.A,ub.e, cna.=, -FINGERPRJKT-, 

&dum, &present) && present ) ) 
goto badcol; 

/* get the reference row */ 
if C(UIMS2 VARTYPE_CALC_VALUE(-ROW_SEL",args[0], &row_sel)) 

!TBL ACCESS_X_GET_VALUE(table, row_sel->id -1, cname, 
"CELL_SUPPORT", (int *)&ref, &err ) | | 

!ref ) 

{ 

UIMS2_WRITE_ERROR( 
"Error: Invalid reference row selection for %fp_hex\n" ); 

return 0; 

} 

dum = UIMS2_MessageBuffer; 

err = (reflO]+31) / 32; 

for(i = l;i<=err ;i + + ,dum+=8) 

sprintf(dum, "%.8x", ref[i] ); 
(*writer)( UIMS2_MessageBuffer ); 



return 1; 



badcol: 



UIMS2_WRITE_ERROR( 
"Error: Invalid column selection for %fp_hex\n" ); 

return 0; 



/************************** 

*/ 

power 

5 */ 
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Ap pendix "G" 



*/ 

/* David E. Patterson 

*/ 



15 



20 



25 



substantially changed 6/96 for cores-based reorganization of operation 

* updated to include more reaction info (Dick Cramer - 10/24/96) 

* updated to use DB_CT_CCT_GET_PRD routines 10/29/96 DEP) 

* 

* This program performs the following functions: 

* (1) read in one line from a ".files" file, one line per cores/Xl/X2 

file 

* (2) read in one core to process (core / XI / X2 file) = = a cSLN 

* (3) for each cSLN, open a fp file to contain fingerprints 

* (a) first is fingerprint size in bits 

(b) 2cd is number of records in segment (header + core + nl 4- n2) 

* (c) 3rd record notes size of record in bytes 

(d) 4th is number of cSLN segments included (= = 1 here always) 
. (e) 5th and following ints contain the ASCII .2DRULES filename 
(A) next (second) record represents an "augmented fingerprint" 
which is made by attaching invariant pieces of XI and X2 to 



core 



30 



-> cardinality plus bitset is the record for every fp <- 
(B) then Nl + N2 augmented fingerprints records for all of the 
structural variations 



266 

. (4) compute MBITS and LB^ estimate, of worst case tnissing bits 

. (5) write out a "master record" entry for the result 



* power -file <name> -line <m> -core 
-prefix <file> +debug 



<n> -fraction <f> -screendef <file> 



10 



* Options: 

* -file name 



15 



* 



-line number 
-core number 
-fraction f 



20 yield 



* -screendef file 



25 



* 
* 
* 



-prefix file 



30 



* + debug 



-name is file with names of cores/Xl/X2 that 
determines what gets built 

- which line in file to process 
. which core in corefile named in line to process 

- fraction of products to be evaluated, 0.0 - 1-0 
or if more than 1.0, it is the NUMBER desired 
and an appropriate fraction is computed to 

approximately this number 

- name of a file containing the fingerprint 
definition rules. 

- name from which output filenames will be formed 
(i.e. -prefix Hi --> Hi.fp and Hi.mf 

- writes irrelevant info to stderr 

This flag forces the display of all 
options 
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**** 



/ 

/* use 3db 

* dbcc power, c -o power */ 
5 #include <stdio.h> 
^include <signal.h> 
#include <ctype.h> 
#include <unistd.h> 
#include <string.h> 
10 #include <sys/stat.h> 
#include <math.h> 
#include "parseopt.h" 
^include "utl_str.h" 
^include "utl_mem.h" 
15 #include "utl_file.h" 
#include "utl_math.h" 
#include "ct.h" 
#include "ct_expr.h" 
^include "ct_prota.h" 
20 #include "import_proto.h" 



#define GoodExit 0 
#define ErrorExit 1 
#define Visual(s) ( 
static int 
25 static char 
static char 

static int 
static int 
30 static int 
static int 
static int 



fprintf s; } 
(*ExploderFunction)() ; 
*ScreenFileName; 
DefaultScreenFileName[321 
= "standard.2DRULES"; 

*ScreenStructure; 

**fingerPointer; 

*fingerPrint = 0; 

*fingerMask = 0; 

fingerBits; 
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static int 
static int 
static double 
static int 

5 static char 
static char 
static char 
static char 
static char 

10 static char 

static FILE 
static FILE 
static FILE 
15 static int 
static char 
static char 
static char 
static int 
20 static int 
static int 
static int 
static int 
static int 
25 static int 
static int 
static int 
static char 
static int 
30 static int 
static int 
static int 
static int 



Mbits; 
Lbits; 

Fraction = LO ; 
TopNumber = 0; 
*FileOfFiles; 

*Corefile, *Xlfile, *X2file; 

*PrefixForFiles; 

*ReactionCode; 

*UserRxnName; 

DefaultPrefixForFiIes[20] 

= "cslnj)reprocess"; 
*InputSourceFile; 
*FileOfFilesFile; 
*fpFile; 
nbits[256]; 

*fullQuery; 

**FGPT_X; 

*Xrlist; 

WordsPerFingerprint = 0; 
BytesPerFingerPrint = 0; 
CurrentSlnld = 0; 
DebugLevel; 
User Aborted; 
NuUCore; 
MoreRxnInfo; 
StartCore = 0; 

LineFile = 0; 
*CombNameTemplate; 

CombCounter; 
**Y_01; /* fingerprints */ 
**Y__02; /* " 
nY 01; /* number of structures */ 

nY 02; /* " 
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static int nProcessed = 0; 

static void *fullCsln, *xcoreCsln, *templCsln, nemp2Csln; 
static char *CoreSln; 
static char *Xlxfile, *X2xfile; 
5 static struct ParseOptions Options[] = { 

... DO NOT MOVE ENTRIES IN THIS TABLE. ADD EKTRIES ONLY AT THE 
END. 

10 {-me", ParseOptString, &FileOfFiles, 

"File listing all input files" }, 
{"fraction", ParseOptDouble, &Fraction, 

"Proportion of products(0 to 1) or Number to test" }, 
{"screendef. ParseOptOldFile, &ScreenFileName, 

"File which defines the UNITY screen" }, 
{"line", ParseOptInt, &LineFile, 

"Sequential entry to use in Files file" }, 
{"core", ParseOptInt, &StartCore, 

"Sequential core to use in Cores file" }, 
{"prefix", ParseOptString, &PrefixForFiles, 

"Filename root for output files" }, 
{"debug", ParseOptBoolean, &DebugIxvel, 

"Use + debug to enable debugging messages" }, 

25 !ntUBS_OUTPUT_MESSAGE0{ return 0;} /* just for compiling OK */ 
int UIMS2_WRITE_PHOTO() { return 0; } 

int lowerca'se (s) char *s; {while (*s) { if isupper(*s) *s = tolower(*s); 
s+ + ;}} 

static void UserHitControlCQ 
30 /*+I 



15 



20 



* This function is the signal handler for user initiated program 
termination. 
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. It's only role is to set a flag indicating that the user wishes to abort 
the program. 

* Author Date Description 



5 * = = = = = = 

* G. B. Smith 02-09-93 Original Version 



( 

10 UserAborted = 1; 

} 



static int ParseArguments( argc, argv ) 
* 

* This function parses the command line arguments. 

1 on a successful command line parse, 0 otherwise. 



15 
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* Returns: 
* 

* Warnings: 

* Errors: 

* See Also: 
* 

* 

* Author Date Description 

* = = = = = = = = = = • 

* G. B. Smith 02-09-93 Original Version 



30 */ 

int argc; 



25 
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char **argv; 
( 



int nargs, 

noptions = sizeof( Options )/sizeof(Options[0]); 
5 nargs = UTL_PARSE_OPT( argc, argv, noptions, Options ); 

if( '.nargs ) goto SyntaxError; 
if ( CStartCore) 1 1 (ILineFile)) return 0; 
if (iPrefxxForFiles) PrefixForFiles = DefaultPrefixForFiles; 

return 1; 
10 SyntaxError: 

return 0; 

} 

int main( argc, argv ) 
/*+E 

15 * 

*/ 

int argc; 
char **argv; 

{ 

20 long 

totalTime, 
finishTime; 

*** Establish handler for a user interrupt. 

signal( SIGINT, UserHitControlC); 

#ifdef SIGHUP 

signal( SIGHUP, UserHitControlC); 

#endif 

30 if( !ParseArguments( argc, argv ) ) 

goto SyntaxEnor; 
time( &startTime ); 
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10 



15 



20 



Visual((stderr,"Begin reading csln : %s",ctime(&startTime))); 
/* Let's actually do something now */ 

WarmUpO; ^ 
if(!(FileOfFilesFile = UTL_FILE_FOPEN(FileOfFiles,"r"))) return 0; 

GetFileSet (FileOfFUesFile); /* getcSLN info - core, XI. X2 */ 

if (! FGPT_X[0] 1 1 ! FGPT_X[1] ) goto FailureExit; 

if (!*FGPT_X[0] 1 1 !*FGPT_X[1] ) goto FailureExit; 

if (IReadTheCslnlnfoO) goto FailureExit; 

time( &finishTime ); 

Visual((stderr,"Begin computations: %s«,ctime(&fmishTime))); 
time( &fmishTime ); 

if (lUserAborted && !DoPiecewiseFingerprints()) goto FailureExit; 
totalTime = fmishTime - startTime; 
if( ! totalTime ) totalTime = 1; 

Visual((stderr, "Created %d Finger Processed reagents in 

nProcessed = nY_01+nY_02 )); 
Visual((stderr,"%d Hours, %d min, %d secs\n", 
totalTime/(60*60), 
(totalTime%(60*60))/60, 

(totalTime % 60))); 

Visual((stderr,"Each comparison required %.8f seconds to calculateXn", 
(totalTime/((double)(nProcessed?nProcessed:l))))); 



time( &finishTime ); 

Visual((stderr,"\nNow evaluating missing bits distribution at %s\n", 
25 ctime(&fmishTime))); 

if (lUserAborted && ICheckMissingBitsQ) goto FailureExit; 

CoolDownO; 

time( &fmishTime ); 

Visual((stderr,"End bits checking: %s",ctime(&fmishTime))); 
Visual((stderr,"End cSLN preparation : %s",ctime(&fmishTime))); 
User Aborted ? exit(ErrorExit) : exit(GoodExit); 
SyntaxError: 



30 



• 
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exit(l); 
FailureExit: 

exit(EiTorExit); 

} 

5 int GetFileSet(f) 
FILE *f; 
{ 

char *three_files, *hold, *pch; 
int i; 

10 /* does not read the core itself 



for (i=0;i<LineFile;i++) 
if( .1 == UTL_SCAN_GETS(FileOfFilesFile, "\\","r\&three_files)) return 

15 see how many tokens there are - if > 5, new format with rxn data V 
for (i = 0, pch = three.files; *pch; pch + +) if rpch =- ' ') + ; 
if ((MoreRxnInfo = i>4) ) { 

for (pch = three_files; *pch != pch++); *pch + + = '\0'; 
if(!(ReactionCode = UTL_STR_SAVE( three.files ) )) return 0; 
20 for (hold = pch ; *pch != "; pch + +); *pch++ = '\0'; 

NullCcre = (int) strstr( "YES", hold ); 

for (hold = pch; *pch != ' '; pch++); *pch++ = '\0'; 
if (KUserRxnName = UTL_STR_SAVE( hold ) )) return 0; 

} 

25 else pch = three_files; 

for (Corefile = pch; *Corefile == "; Corefile++) ; 
for (Xlfile = Corefile ; *Xlfile != ' '; Xlfile++) ; 
*Xlfile4-+ = '\0'; 

for ( ; *Xlfile == "; Xlfile++) ; 

30 for (X2file - Xlfile ; *X2file ! = X2file+ +) ; 
*X2file++ = '\0'; 

for( .*x2file==";X2file++) ; 
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Corefile = UTL_STR_SAVE(Corefile); 
Xifiie = UTL_STR_SAVE( Xlfile); 
X2file = UTL_STR_SAVE( X2file); 



5 hold = 0; 

nY_01 = iestread(Xlfile,hold,l); 
nY_02 = testread(X2file,hold,2); 

return 1; 

} 

10 /* free up the arrays in the loop */ 
int CoolDownO 
{ 

char *hold; 
int i; 

15 for (i=0;i<nY_01;i++) UTL_MEM_FREE(Y_01[i]); 
UTL_MEM_FREE(Y_01); 

for (r=0;i<nY_02;i+4-) UTL_MEM_FREE(Y_02[i]); 
UTL_MEM_FREE(Y_02); 
UTL_FILE_DELETE(X Ixfile); 
20 UTL_FILE_DELETE(X2xfile); 

UTL_MEM_FREE(Corefile) ; 
UTL_MEM_FREE( Xlfile); 
UTL_MEM_FREE( X2me); 
25 return 1; 

} 

int WarmUpO 
{ 

int i; 

(i&16)/16 + (i&32)/32 + (i&64)/64 + 

(i&128)/128 ; 
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if (.ScreenFileName) ScreenFileName = DefaultScreenFileName; 
if ('(fp = UTL_FILE_FOPEN(ScreenFileName."r"))) return 0; 
ScreenStructure = (int *) DB.Bm.PARSEjDSCREENCfp); 
UTL_FILE_FCLOSE(fip); fjp = 0; 
5 if ('ScreenStructure) return 0; 

BytesPerFingerPrint = DB_BIT2_GET_SIZE( ScreenStructure ); 
WordsPerFingerprint = (BytesPerFingerPrint + 3) / 4; 
fingerprint = (int *) UTL_MEM_ALLOC( BytesPerFingerPnnt); 
fingerMask = (int UTL_MEM_ALLOC( BytesPerFingerPnnt); 
10 if (Fraction > 1.0) TopNumber = Fraction; 

Get_BY_SLN_MaskO; /* Set up for LBITS by ignoring the counts / 

FGPT_X = (char**) UTL_MEM_ALLOC( sizeof(char *) * 2 ); 
return 1; 
15 } 

int Get_BY_SLN_Mask() 
{ 

/* placeholder until a general one is wntten. 

This is correct for standard. 2DRULES as of 6/96 */ 

20 int i; 

unsigned char *foo; 
foo = (unsigned char *) fingerMask; 
for(i= 0;i<ll6;i + +)*foo++ =OxFF; 
for (i=116;i<124;i++) *foo++ = 0; 
25 return 1; 
} 

char *GenerateMySln(core) 
char *core; 

1^ ??? CONVERT THE Y_Ox to Xn in core ??? */ 

char *foo, *oof, *goo; 
goo = UTL_STR_SAVE(core); 
foo =strstr(goo,"Y_Or'); 
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foo[l]=foo[2] = ' •; foo[0] = 'X'; 
oof =strstr(goo,"Y_02"); 
oof[ll=oofl2]=' '; oof[01='X'; 
for (oof=foo=goo; *oof; oof++) 

if (*oof != ' ') *foo++ = *oof; 
*foo = '\0'; 



return goo; 



} 

/* THis routine should open the fp output file 
generate the full cSLN 
generate the augmented core SLN 
write header and augmented core fp to fp file 
generate *.rgroup files later fp work. */ 

int ReadTheCslnlnfoO 
15 { 

int i; . . 
char iunk, .ho,d. .line, -one, '.wo, -.hr, -fou, .f.v. ..leg.on; 

char *my_concatenateO, *augmentO; 
char *my_how_youve_grownO; 

20 FILE *tfil; 

if (! (InputSourceFile = fopen(Corefile,"r"))) return 0; 

for (i=0;i<StartCore;i++) 

if (.1 =- UTL.SCAN.GETSC InputSourceFile, "r, Mne)) return 0; 

fclose(InputSourceFile) ; 

25 if (!GrabXrlist(line)) return 0; 

one = strstr(line,"<"); ^ . ,• */ 

zap the parameters at the end of the hne / 

*one= \vJ ; 

CoreSln = GenerateMySln(line); 

if (!(hold = UTL_STR_CONCATENATE(PrefixForFiles,".fp"))) return 0; 
30 if (! (fpFile = fopen(hold,"w"))) return 0; 

UTL_MEM_FREE(hold) ; 
i = BytesPerFingerPrint * 8 ; 
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UTL_FILE_FWRITE( &i ,sizeof(int), 1 ,fpFile); 

fingerPrint[01 = 2 + nY_01 + nY_02; 

fingerPrint[l] = sizeof(int)*(WordsPerFingerprint + 1); 

fingerPrint[2] = 1; 
5 junk = ScreenFileName; 

hold = (char *) &(fingerPrint[3]) ; 

for (i=0; i < (WordsPerFingerprint-3)*sizeof(int); + , junk++) 
{ *hold++ = *junk; 

if ( ! *junk ) break; } 
10 UTL FILE_FWRITE(fingerPrint,sizeof(int),WordsPerFingerprint,fpFile); 

if (.(ilxfUe = UTL_STR_CONCATENATE(PrenxForFiles ••.FGPT.l"))) return 0; 

tfil = fopen(XlxfiIe,"w"); 
fprintf(tfil,"%s\n",FGPT_X[0]); 

fclose(tfil); ^ ^ 

15 if (!(X2xme = UTL_STR_CONCATENATE(PrefixForFiles,".FGPT.2"))) return 0; 

tfil = fopen(X2xfile,"w"); 
fprintf(tfil," %s\n",FGPT_X[l]); 
fclose(tfil); 

if (!sln_defmes_csln( &xcoreCsln, Xlxfile, X2xfile)) return 0; 
20 if (!sln_defines_csln( &templCsln, Xlfile , X2xfile)) return 0; 
if (!sln'defines_csln( &temp2Csln, Xlxfile, X2file)) return 0; 
if (!sln^defines_csln( &fullCsln, XlfUe, X2file)) return 0; 
return 1; 

} 

25 int GrabXrlist(string) 
char *string; 
{ 

I* find XRLIST= and grab what's in there ! */ 
char *foo, *strip_down(); 
30 if ('.(string = strstr(string,"XRLIST = "))) return 0; 
Xrlist = strip_down(string); 
return 1; 
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} 

int testread(old, new, which) 
char *old, *new; 
int which; 
5 { 

FILE *file, *elif; 
int i; 

char *line; 
char *strip_downO; 
10 /* get and hold FGPT_X info here 

Expect it to be at top of file preceded by a # */ 
if (!(file = fopen(old,"r"))) return 0; 
if (new && !(elif = fopen(new,"w"))) return 0; 

which—; 
15 FGPT_X[which] = 0; 
while (!FGPT_X[which]) 

I if (.1 == UTL_SCAN_GETS( file, "W" , &line)) return 0; 
if ( line = strstrOine,"FGPT_X = ") ) FGPT_Xtwhich] - strip_down(line); 

20 /* this won't really work if the attachment point is NOT the first atom 
listed*/ 

FGPT_X[which] = UTL_STR_C0NCATENATE("R1", FGPT_X[whtch]); 
for(i=0; ;i++) 

25 \f (-1 = = UTL_SCAN_GETS( file, "W" , , &line)) break; 
if (new) 

{ UTL_SCAN_TOKENlZE(line,' < ','\\'); 
fprintf(elif ," % s\n" ,line); } 

} 

30 fclose(file); if (new) fclose(elif); 
return i; 

} 
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char *strip_down(string) 

char *string; 

{ 

int i; 

5 char foo, *retme; 

string = strstr(string," = ") ; 

, , ^ , . „ M *ctrin2 == ""; string++) foo = *string; 

for ( ; *stnng = - — 1 1 sinng , t. 

if (foo != "") 
{for (i=0; ; i++) 
10 if ( (string[i] == ';') I i (stringM '>')) break; } 

else 

{for (i=0; ; i++) 

if ( (string[i] == "'•)) break; } 
foo = string[i]; 
15 string[i]= '\0'; 

retme = UTL_STR_SAVE(string); 

string[i] = foo; 
return retme; 

20 /* Assume that the fp file is opened and written to earlier */ 
int DoPiecewiseFingerprintsO 
{ 

char *hold, *Unel, *line2; 
int i; 

25 if (!(Y_01 = (int **) UTL_MEM_ALLOC( nY_01 - sizeof(int *)))) 
return 0; 

if (!(Y_02 = (int **) UTL_MEM_ALLOC( nY_02 * sizeof(int *)))) 
return 0; 

30 MakeAllPrints( xcoreCsln ,1,1, &fingerPrint); 

DB CT CCT_GET_PRD_CLEANUP( xcoreCsln ); 



for(i=0;i<nY_01;i++) 
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^ if (!(Y_01[i] = (int *) UTL_MEM_ALLOC(WordsPerFingerprint * 

sizeof(int)))) 

return 0; 

5 } 

MakeAllPrints( templCsln , nY_01, 1, Y_01); 

DB CT_CCT_GET_PRD_CLEA>aJP( templCsln ); 

for(i=0;i<nY_02;i + +) 

10 if (!(Y_02[i] = (int *) UTL_MEM_ALLOC(WordsPerFingerpnnt 

sizeof(int)))) 

return 0; 

} 

MakeAllPrints( temp2Csln , 1, nY_02, Y_02); 
15 DB CT_CCT_GET_PRD_CLEANUP( temp2Csln ); 

return 1; 

} 



int 
{ 



WritefpFunc(struct CtConnectionTable *ct, int num, int**indexes) 



20 int nbits; 
int *fprint; 

fprint = *fingerPointer+ + ; 
memset ( fprint, 0, BytesPerFingerPrint ); 

if( !DB_BIT2_EVALUATE( ct, ScreenStructure, fprint, &nbits )) 
25 return 0 ; 

UTL_FILE_FWRITE( &nbits ,sizeof(int), 1 ,fpFile); 
UTL_FILE_FWRITE(fprint,sizeof(int),WordsPerFingerprint,fpFile); 

return 1; 



30 } 



int 



GrabfpFunc(struct CtConnectionTable *ct, int num, int**indexes) 
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int *fprint; 

fprint = *fingerPointer+ + ; 
memset ( fprint, 0, BytesPerFingerPrint ); 

if( !DB_Bm_EVALUATE( ct, ScreenStructure, fjprint, &fingerBits )) 
5 return 0 ; 



return 1; 

} 

int MakeOnePrint( void *Csln , int i, int j, int *fip) 
10 { 

static int **productIndexes = 0; 
if (Iproductlndexes) 

{ productlndexes = (int -)UTL_MEM_CALLOC(2,sizeof(int *)); 
productlndexes[0] = (int *)UTL_MEM_CALLOC(l,sizeof(int)); 
15 productIndexes[l] = (int *)UTL_MEM_CALLOC(l ,sizeof(int)); 

} 

productlndexes[01[0] = i + 1; 
productlndexes[l][0] =j + l; 
fingerPointer = &fip; 

20 DB CT_CCT_GET_PRD_PRODUCT(Csln, 1, productlndexes, GrabfpFunc); 
return 1; 

} 

int MakeAllPrints(void *CslnThing, int nl, int n2, mt **pfp) 
{ 

25 int numProducts, **productIndexes, i, j, nProcessed; 
int numConnections = 2; 
numProducts = nl * n2; 
nProcessed = 0; 

productlndexes = (int -)UTL_MEM_CALLOC(numConnections,sizeof(int *)); 

30 for ( i = 0 ; i < numConnections ; i + + ) 

productIndexes[i] = (int *)UTL_MEM_CALLOC(numProducts,sizeof(int)); 



for (i=0;i<nl;i++) for (i=0;j<n2;j + +) 
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{productIndexes[0][nProcessed] = i+1; 
productIndexes[l][nProcessed] = j + 1; 
nProcessed + +; 
} 

5 fingerPointer = pfp; 

DB CT CCT_GET_PRD_PRODUCT(CslnThing, numProducts, productlndexes, 

WritefpFunc); 

for ( i = 0 ; i < numConnections ; i++ ) UTL_MEM_FREE(productIndexes[i]); 
UTL_MEM_FREE(productIndexes) ; 
10 return 1; 
} 

/* Also find Mbits and Lbits 

and write them where they belong */ 

/* 

15 Should reorganize to find worst cases rather than pure random 

*/ 

int CheckMissingBitsO 
{ 

int argCount, err =1, i, j; 
20 int counts[21]; 

nProcessed = 0; 

for (i=0;i<21;i++) counts[i]=0; 

if (TopNumber) Fraction = (double) TopNumber / (double) (nY_01 * nY_02); 
for (i=0;i<nY_01;i++) for (j=0;j <nY_02;j + +) 
25 { 

if (UTL_MATH_URANDO > Fraction) continue; 
nProcessed+ + ; 

MakeOnePrint( fuUCsln , i, j, fmgerPrint); 
CompareFingerPrint(Y_0 1 [i] , Y_021j] ,20, counts) ; 

30 } 

WriteMissingBits(20,counts); 

WriteMasterRecordO ; 




CompareFingerPrint( one, two, Nbins, bins) 
int *one, *two, Nbins, *bins; 
5 { 

unsigned char *hl, *h2, *h3, *fing; 
int i, product, card, Icard, Ibits; 
hi = (unsigned char *) one; 
h2 = (unsigned char *) two; 
10 h3 = (unsigned char *) fingerMask; 
fxng = (unsigned char *) fingerPrint; 
Icard = card = Ibits = 0; 

for (i=0;i<BytesPerFingerPrint;i + + , hl + + ,h2 + + ,h3 + + ,fing++) 
{ card += nbits[ *hl | *h2 ]; 
15 Ibits += nbits[ *h3 & *fing ]; 

Icard += nbits[(*hl | *h2 ) & *h3 ]; } 
if ((card = fingerBits - card) < 0) goto NoWay; /* should be impossible */ 
if ((Icard = Ibits - Icard) < 0) goto NoWay; /* should be impossible */ 
if ( card > Mbits) Mbits = card; 
20 if (Icard > Lbits) Lbits = Icard; 
if (card > Nbins) card = Nbins; 
bins[card] +=1; 
return 1; 
NoWay: 
25 return 0; 

} 

WriteMissingBits(n ,counts) 

int n, *counts; 

{ 

30 int i, sum; 
sum = 0; 

for(i=0;i< =n;i++) {printf("%d - %d; ",i,counts[i]); sum += countsW; } 



284 



printf("\n"); 
if (sum ! = nProcessed) 
fprintf(stderr, 

"Mismatch indicates possible error in core entryAnOnly %d of %d 

5 foundAn", 

sum, nProcessed); 

} 

/* File format of the "master record" is 
Reaction class name 
10 Reaction specific name 

Number of varying sites == 2 so far 

Mbits 

Lbits 

*.core filename 
15 *.core index 

prefix, fp 

number of fp records before 1st = = 0 in this program 
XI filename 
X2 filename 

20 */ 

WriteMasterRecordO 

{ 

FILE *fp; 
char *hold; 

25 if (!(hold = UTL_STR_CONCATENATE(PrefixForFiles,".mr))) return 0; 
if (!(fp = UTL_FILE_FOPEN(hold,"w"))) return 0; 
UTL_MEM_FREE(hold); 

if ('.(hold = UTL_STR_CONCATENATE(PrefixForFiles,".fp"))) return 0; 

if (MoreRxnInfo) 

30 fprintf(fp, "Reaction class 

%s%s\n%s\n%d\n%d\n%d\n%s\n%d\n%s\n%d\n%s\n%s\n", 

ReactionCode, NuUCore ? " NO_core" : UserRxnName, 2, Mbits, 
Lbits, Corefile, StartCore, hold, 0, Xlfile, X2file); 
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else fprintf(fp,"Reaction class 
Unknown\n%s\n%d\n%d\n%d\n%s\n%d\n%s\n%d\n%s\n%s\n", 

PrefixForFiles, 2, Mbits, Lbits, Corefile, StartCore, hold, 0, Xlfile, 
X2file); 

5 UTL_MEM_FREE(hold); 
UTL_FILE_FCLOSE(fp); 

Lt sln_defines_csln(void **c, char *filel, char *file2) 
{ 

10 int numConnections = 0; 
char *connectionFiles[2]; 

if (filel) { connectionFiles[ numConnections + + ] = filel; } 
if (file2) { connectionFiles[ numConnections + + ] = file2; } 
if (numConnections < 2) { fprintf(stderr,-\nNo XI or X2 file - 

15 failureAn"); 

return 0; } 

*c = (void DB_CT_CCT_GET_PRD_INlT(CoreSln, Xrlist, numConnections, 

connectionFiles); 



if ( !*c ) 

20 { fprintf(stderr,nnUnable to inif); return 0 ; } 
return 1; 
} 
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Ap pendix "H 

^ ^.^^^^^i,^t^iic:ic»c*c**«*** 

1^ Similarity - formerly dbcslnsim */ 

5 /* mod to read from the master file format (DEP 6/26/96) */ 

/* mod to read/write bitset files (DEP 9/19/96) */ 

/* mod to read $TA_MOLTABLES screendef file if not where fp file points */ 

/* mod to take the "-q" format of input SLN 

/* mod to use fp mask to improve searches (DEP 10/24/96) 

,0 



/ 

* 

/*+C 

* This program evaluates (approximate) Tanimoto 2D similarity vs one cSLN 

* based on preprocessing of the substituent reagents. 
15 * 

* Input file is a master file with one multiline record per cSLN. 

* Record format is 

* Reaction class xxxx (where "Reaction class" is a literal) 

* reaction_name 

20 * number_of_sv_sites 

* missing_bits_count (may be overridden by mask) 

* hashed_only_missing__bits__count 

* core_filename 

* core__filenameJndex_of_core 
25 * fingerprint_filename 

* offset_into_fingerprint_file 

* first_sv_file_Xl 

* secod_sv_file_X2 (etc if more than two sv_sites) 
* 

30 * Queries are input as SLN repeatedly from stdin; ending on ^D or X 

* The optional ASCII output file contains one line per hit, of the form 

* Yl Y2 T Tmax 
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10 



15 



* where Yl = index of the substituent in XI. pro file 

* Y2 = index of the substituent in X2.pro file 

* T = apparent Tanimoto similarity 

* Tmax = maximum possible Tanimoto, given the slop bits (see below) 

* The (required) checkpoint file is in the standard CSR format, which can 

* also be used instead of the master file to start a search. 

* Similarity -master <name> -bitset <name> -Tanimoto <real> -range <exp> 

-index <int> -maxhits <int> -output <name> -checkpoint <name> 

* + debug 

* 



20 



25 



Options: 



* 

* 

* 



-master name 
-bitset name 



-index number 

-Tanimoto tan 

-maxhits max 
-input filename 



- name is the file with master file records 

- name is a result of an earlier search operation 
(use EITHER master or bitset) 

- which sequential record in master file to use 
OR offset into bitset in a bitset file 

- tan is a Tanimoto similarity 0.0 - 1.0 
(default is 0.85) 

- stop when max hits are found (default infinity) 

- name of file with queries (default stdin) 

- single SLN query string 



30 



-output filename - specifies the output file for the hit info 

(Mainly used for debugging - otherwise obsolete) 



288 

* -checkpoint name -file to which bitset results will be written 

* -mask hex - hex format bitmask of missing bits (tan_hex form) 

5 * -range range_exp - set of internal cSLN ids (Y_01 varies slowest) 

for which similarity will be computed. Range_exp 
is a comma separated list of one or more of the 

* following primitives: 



10 * 



* - everything in the cSLN 

M8 - ids 1,2,3,. ...,18 

5-* - ids froni 5 to the last in the 

cSLN. 
17 - id 17 only 



15 * 

^ -append - append results to an existing output file 

By default an output file is overwritten. 



* -l-debug - writes irrelevant info to stderr 
20 * 

^ This flag forces the display of all 

* options 

25 / 

/* use 3db 

* dbcc Similarity. c -o Similarity */ 
#include <stdio.h> 

^include < signal. h> 
30 ^include <ctype.h> 
^include <unistd.h> 
^include < string. h> 
^include <sys/stat.h> 
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^include <math.h> 
^include "parseopt.h" 
^include "utl^str.h" 
^include "utl_mem.h" 
5 ^include "ua_file.h" 
#include "ct.h" 
^include "ct_expr.h" 
#include "ct_j)roto.h" 
^include "import_j>roto.h" 
10 #include "commonData.h" 
static char 
static char 
static int 
static FILE 
15 static char 
static int 
static int 
static char 
static char 
20 static char 
static char 
static char 
static double 
static int 
25 static int 
static int 
static int 
static int 
static char 
30 static int 
static int 
static int 
static int 



*OutputFileNanie =0; 
*MasterFile =0; 
MasterRecord; 
*MasterFile_File; 

*FngrFile; 
FingerCore_Card; 
*FingerCore_FP; 
*InputSource = 0; 
*fullQuery; 
*BitsetFile; 

*CheckPointFileName; 
*directQuery = 0; 
Tanimoto = 0.85; 
AppendToOutputFile = 0; 
WordsPerFingerprint = 0; 

BytesPerFingerPrint = 0; 

CurrentSlnId = 0; 

NoMorehitsPlease = 999999999; 

*DatabaseRangeString = 

DebugLevel; 

User Aborted; 

First, Last; 

Pro size ; 
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10 



15 



static char *mASCII = 0; 

static int *MaskMissingBits = 0; 

static int *MaskQueryBits = 0; 

static struct ParseOptions OptionsQ = ( 

' ".'.'.DO NOT MOVE EKMESm THIS TABLE. ADD ENTRIES ONLY AT THE 

END. 

{"master" , ParseOptString, &MasterFile, 

"Name is the file with master file records" }, 
("bitset", ParseOptString, &BitsetFile, 

"Name is the file with bitset records" }, 
{"Tanimoto", ParseOptDouble, &Tanimoto, 

"Similarity threshold (0.0 to 1.0)" }, 
{"index" , ParseOptInt, &MasterRecord, 

"Which MasterRecord entry 1-n" }, 
{"maxhits", ParseOptInt, &NoMorehitsPlease, 

"Maximum number of hits before stopping" }, 
{"input" , ParseOptString, &InputSource, 

"File from which queries will be read( default stdin). "}, 
{"q", ParseOptString, &directQuery, 

"Query string to use instead of a file or stdin"}, 
{"output" , ParseOptString, &OutputFileName, 

"File to which ASH hit info will be written. OBSOLETE "}, 
{"checkpoint", ParseOptString, &CheckPointFileName, 

"File to which bitset info will be written."}, 
{"mask", ParseOptString, &m ASCII, 

"Hex mask of missing bits" }, 
{"range", ParseOptString, &DatabaseRangeString, 

"Range of cSLN ids to compare to query" }, 
{"append", ParseOptNoArg, &AppendToOutputFile, 

"Use -append to append results to an existing file" }, 
{"debug", ParseOptBoolean, &DebugLevel, 



20 



25 



30 
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"Use +debug to enable debugging messages" }, 

!nt UBS_OUTPUT_MESSAGE0 { return 0; } /* just for compiling OK */ 
int UIMS2_WRITE_PHOTO0 { return 0; } 
5 int lowercase (s) char *s; {while (*s) { if isupper(*s) *s = tolower(*s); s+ 4-;}} 
static void UserHitControlCQ 

/*+I 
* 

* This function is the signal handler for user initiated program termination. 

10 * It's only role is to set a flag indicating that the user wishes to abort the program. 
* 

* Author Date Description 

* G. B. Smith 02-09-93 Original Version 

15 * 
*/ 

{ 

User Aborted = 1; 

} 

20 static int ParseArguments( argc, argv ) 

* This function parses the command line arguments. 

25 * Returns: 1 on a successful command line parse, 0 otherwise. 

* Warnings: 
* 

* Errors: 

30 * 

* See Also: 
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* Author Date Description 



* G. B. Smith 02-09-93 Original Version 

5 * 

*/ 

int argc; 
char **argv; 

{ 

10 int nargs, 

noptions = sizeof( Options )/sizeof(Options[0]); 

nargs = UTL_PARSE_OPT( argc, argv, noptions, Options ); 

if( ! nargs ) goto SyntaxError; 

return 1; 
15 SyntaxError: 

return 0; 

} 

static int OpenOutputFileQ 
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* Returns: 1 on sucesss, else 0 



* 

*/ 

{ 

25 char *msg; 

FILE *fp; 

if( OutputFileNanne ) 

{ 



30 we need .o create output fles under the ownership of the REAL user no, the 
EFFECTIVE user. This only applies if setuid options are activated. 



293 



{ 

struct Stat statBuff ; 
int uid ; 
int euid ; 
5 uid = getuidO ; 

euid = geteuidO; 
stat(OutputFileName, &statBuff); 

/* 

** There are two cases 
10 ** (1) the file to output to exists 

use the ownership of the current owner of the f.le or if you cant do that 

** do not do anything. 

** (2) The file is being created. 

** use the ownership of the REAL user. 



15 */ 



if ( access(OutputFileName, F_OK) - - 0 ) 
( /* If the file exist and the real user is the owner of the file */ 
if ( StatBuff. st_uid = = uid ) 
seteuid(uid); 



20 } 

else 



{ /* Create the file as the REAL user */ 
seteuid(uid); 

} 



25 } 



OutputFile = fopen( OutputFileName, (AppendToOutputFile?"a":"wb")); 

if( '.OutputFile ) { 

{printf(stderr,"Error: Failed to open output file \"%s\"\n", 

OutputFileName ); 

3Q goto ErrorRetum; 

} 

} 

return 1; 
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ErrorRetum: 

return 0; 

} 

static int ParseRangeExpr( expr, maximum, low, high ) 

5 

* 

* Function evaluates a structure range expression. See the module 

* description in this file for a definition of structure range expressions. 

* 

10 * Returns: Function returns 1 if the expression is correct. If the 

* expression is incorrect 0 is returned. 
* 

* Author Date Description 

15 *G. B.Smith 02-12-91 Original Version 

*/ 

char *expr; /* A structure range expression */ 

int maximum; /* Maximum structure number. 999999999 

20 int *low; /* RETURN: low value in the range */ 

int *high; /* RETURN: High value in the range */ 

{ 

char *p; 

for( p=expr; *p &.& isdigit(*p); p++ ); 

25 if( !*P ) ( 

sscanf( expr, "%d", low ); 

*high = *low; 
} else if( 2 = = sscanfC expr, "%d-%d", low, high)){ 

30 } else if( 1 = = sscanf(expr,"%d-*",low )) { 

*high = maximum; 
} else if( !strcmp( expr, )) { 
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*low = 1; *high = maximum; 
} else { 

fprintf(stderr, "ERROR: Invalid structure range \"%s\"\n", 
expr ); 

5 goto BadExpression; 

} 

if( *low < 1 ) { 

fprintf(stderr, 

"ERROR: Structure range must be greater than zero\n" ); 
jQ goto BadExpression; 

} 

if( *high > maximum ) { 
fprintf(stderr, 

"INFO: Specified range (%d-%d) is greated than the total number of 

15 structures\n", *low, *high ); 

*high = maximum; 

} 

if( *high < *low ) { 

fprintf(stderr, "ERROR: Low range value (%d) is larger than high value 

20 (%d)\n", 

*low, *high ); 
goto BadExpression; 

} 

return 1; 
25 BadExpression: 
return 0; 

} 

int mainC argc, argv ) 

30 * 
*/ 

int argc; 
char **argv; 
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{ 

char comline[2048]; 
long startTime, 
totalTime, 

^ finishTime; 
*** Establish handler for a user interrupt. 

signal( SIGINT, UserHitControlC); 

10 #ifdef SIGHUP 

signal( SIGHUP, UserHitControlC); 

#endif 

if( !ParseArguments( argc, argv ) ) 
goto SyntaxError; 

15 if (!ParseRangeExpr(DatabaseRangeString, 999999999. &First, &Last)) 

goto SyntaxError; 
First--; Last-; 

if (lOpenOutputFileO) goto FailureExit; 

time( &startTime ); 
20 Visual((stderr,"Begin reading files: %s",ctiine(&startTime))); 

/* Let's actually do something now */ 

if (IReadEverythingO) goto FailureExit; 
time( &fmishTime ); 

Visual((stderr,"Begin comparison: %s",ctime(&fmishTime))); 
25 if (lUserAborted && ICompareEverythingO) goto FailureExit; 

if (OutputFile) fclose(OutputFile); 
time( &fmishTime ); 
totalTime = finishTime - startTime; 
if( ItotalTime ) totalTime = 1; 
30 Visual((stderr, "Created %d Finger Prints in nProcessed )); 

Visual((stderr,"%d Hours, %d min, %d secs\n", 
totalTime/(60*60), 
(totalTime%(60*60))/60, 
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(totalTime%60))); 

Visual((stderr,"Each comparison required %.8f seconds to calculateNn" , 
(totalTime/((double)(nProcessed?nProcessed:l))))); 

MakeComLine(comline, 2048, argc, argv); 
5 checkPointProgram(comline); 

Visual((stderr,"End Finger Print Computation: %s",ctime(&finishT.me))); 
UserAborted ? exit(ErrorExit) : exit(GoodExit); 

SyntaxError: 
exit(l); 
10 FailureExit: 

exit(ErrorExit); 

} 

int ReadEverythingO 
{ 

15 char *hold; 

char buff[256]; 
int i; 

int j, offset, size; 
void *bitset=0; 

20 /* because failure here means end program run, no effort to clean up 
memory on error is included. */ 
if (!MasterFile && '.BitsetFile ) return 0; 
setbits_nbits_Init(); 

Totallnputs =1; /* no provision for concatenated */ 
25 InputNameslO] = MasterFile ? MasterFile : BitsetFile; 
InputStartRec[0] = MasterRecord; 
if (MasterFile && '.MasterRecord) InputStartRec[0] = 1 ; 
if (CheckPointFileName) 
OutputCheckpointNames[0] = CheckPointFileName; 

30 else 

{ sprintf(buff," %s_%d_chk.bs" ,InputNames[0],0); 
OutputCheckpointNames[01 = UTL_STR_SAVE(buff); 

} 
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nY_01 = nY_02 = 0; 
if (MasterFile) 

{ if ( !RetrieveMasterFile(InputNames[0], 

MasterFile_File , 



10 



15 



20 



else 

25 ( 



30 



InputStartRec[0], 

&(NuniMissingBits[0]) , 

&(BitsIn AbsentiaNoCount[01) , 

&(CoreFileNames[0]) , 

&(CoreStart[0]), 

&FngrFile, 

&(Xlfile[0]), 

&(X2file[0]), 

&(Y_01_Length[0]), 

&(Y_02_Length[0]), 

&fingerFP[0], 

&fingerOffsets[0], 

&ScreenFileName, 

&BytesPerFingerPrint, 

&WordsPerFingerprint, 

&query, 

&FingerCore_FP, 
&FingerCore_Card ) ) 
goto UnableToReadMaster ; } 



if ( !( bitset = cS_PRDCT_BlTSET_OPEN(InputNames[0], 

InputStartRec[0])) ) 
goto UnableToReadBitset ; 
if ( !RetrieveMasterFileFromBitset(bitset, 

&(MasterFile_Bitset[0]) , 
&(StartRec_Bitset[0]) , 
&(NumMissingBits[0]) , 
&(BitsTn AbsentiaNoCount[0]) , 
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&(CoreFileNames[0]) , 

&(CoreStart[0]), 

&FngrFile, 

&(Xlfile[0]), 

&(X2file[0]), 

&(Y_01_Length[0]), 

&(Y_02_Length[0]), 

&fingerFP[0], 

&fingerOffsets[0], 

&ScreenFileNaine, 

&BytesPerFingerPrint, 

&WordsPerFingerprint, 

&query, 

&FingerCore_FP, 
&FingerCore_Card ) ) 
goto UnableToReadBitset ; 

} 

nY_01 += Y_01_Length[03 ; 
nY_02 += Y_02_Length[0] ; 
if (iWarmUpO) goto UnableToWarmUp; 

Remaininglnput[01 =SomeLeft = Y_01_l^ngth[0] * Y_02_Length[0] ; 
Pro_size = ( 31 + SomeLeft )/32 * 4; 
BitMapStartPoint[0] = 0; 

if (!Good_Products) /* initialize iff not already done */ 

{if (!(Go^_Products = (int *) UTL_MEM_ ALLOC (Pro_size))) return 0; 

memset( Good_Products,0,Pro_size); } 
if (!Dead_Products) /* initialize iff not already done */ 
{if (!(Dead_Products = (int *) UTL_MEM_ALLOC(Pro_size))) return 0; 
mem set( Dead_Products , 0 , Pro_size) ; 
if (bitset) /* assumes actuallsizes matches current sizes!*/ 
{ CS_PRDCT_BITSET_TO_RAW( bitset, Dead_Products, 0); 
not_here(Dead_Products,Pro_size ); 

} 
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if (! (Y 01 = (int -) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) return 0; 
if (1(cy1o1 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_01))) return 0; 
if (KiYj)! = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_01))) return 0; 
5 for (i=0;i<nY_01;i++) 
{ 

if (! GetNextLine( cY_01 +i, Y_01 +i )) return 0; 

} 

10 if (' (Y 02 = (int -) UTL_MEM_ALLOC(sizeof(int *) * nY_02))) return 0; 
if ('(cy1o2 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_02))) return 0; 
if (!(iY_02 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_02))) return 0; 
for (i=0;i<nY_02;i++) 
{ 

15 if (! GetNextLine( cY_02+i,Y_02+i )) return 0; 
} 



return 1; 

UnableToWarmUp: 
20 fprintf(stderr, "Unable to Read screen fileXn"); 

return 0; 
UnableToReadMaster: 

fprintf(stderr, "Unable to Read master file\n"); 

return 0; 
25 UnableToReadBitset: 

fprintf(stderr, "Unable to Read bitset file\n"); 

return 0; 

} 

int WarmUpO 
30 { 

FILE *fp; 

char *where_else,*name, *ext; 
int words; 
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if (!(fp = fopen(ScreenFileName,"r"))) 

^ where_else = UTL_FILE_PARSE(ScreenFileName,4); 
name= UTL_STR_CONCATENATE("sybylbase/tables/",where_else); 

5 uTL_MEM_FREE(where_eIse); 

= UTL_FILE_PARSE(ScreenFileName,5); 

where.else = UTL_FILE_COMPOSE_SPEC( "TA.ROOT", name, ext); 

if (!(fp = fopen(where_else,"r"))) return 0; 

UTL_MEM_FREE(where_else) ; 
10 UTL_MEM_FREE(name); 

UTL_MEM_FREE(ext) ; 

ScreenStructure = (int *) DB_BIT2_PARSE_2DSCREEN(fp); 

fclose(fi)); fp = 0; 
15 if (IScreenStructure) return 0; 
Currentlnput = 0; 

if (mASCII) /* generate binary missing bits */ 

{ 

if ( (strlen(mASCII) / 8) ! = WordsPerFingerprint) return 0; 
20 if (KMaskMissingBits = (int *) UTL_MEM_ALLOC( BytesPerFingerPnnt))) 
return 0; 

if (KMaskQueryBits = (in, •) UTL_MEM.ALLOC( BytesPerFingerPrint))) 
return 0; 

for (words =0;words < WordsPerFingerprint;words+ +) 
25 { 

memcpy (nextS , mASCII , 8) ; 
mASCII + = 8; 

sscanf(next8,"%8x", MaskMissingBits + words); 
} 

30 } 

return 1; 

} 

int MakeAFingerprint( sin, fmgerPrint) 
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char *sln; 

int *fmgerPrint; 

{ 

struct CtConnectionTable *ct; 

5 int nBitsSet; 

if (!(ct = DB_IMPORT_SLN(sln))) return 0; 
memset ( fingerPrint, 0, BytesPerFingerPrint ); 

if( !DB_BIT2_EVALUATE( ct, ScreenStructure, fingerPrint, &nB.tsSet )) 
return 0 ; 
10 return nBitsSet; 

} 

int GetNextLine( pCard, pFP) 
int *pCard, **pFP; 

15 ^ if (H*pFP = (int *) UTL_MEM_ALLOC( BytesPerFingerPrint))) return 0; 
if (.UTL.FILE FREAD( pCard,sizeof(int), 1 ,fmgerFP[0])) return 0; 
if (!UTl'fILE_FREAD( -^pFP ,sizeof(int). WordsPerFingerprint ,fmgerFP[0])) 

return 0; 
return 1; 
20 } 

int IntersectQuery( pintr, pFP) 

int *plntr, **pFP; 

{ 

unsigned char *ptr ,*qtr; 
25 int i, count; 

ptr = (unsigned char *) *pFP; 

qtr = (unsigned char *) query; 

for(count=0, i=0; i< WordsPerFingerprint*4;i+ +) 
count += nbits[ *ptr++ & *qtr++]; 
30 *plntr = count; 

return 1; 

} 

int CompareEverythingO 
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int cqt. CL.10, q_hi> J. '^^hold, inthold, onion, intsc, countinput; 
double max; 
countinput = 0; 
5 if ( ! directQuery ) 

{if (IlnputSource) InputSourceFile = stdin; 

else 

if (! anputSourccFile = fopenanputSource,"r"))) return 0; 

} 

10 while ( directQuery ? 

((fullQuery = directQuery) && countinput = = 0) : 

(-1 != UTL_SCAN_GETS( InputSourceFile, "W", &fullQuery))) 

{ 

countinput+ + ; 

15 if (! (c.query = MakeAFingerprint(fullQuery,query) )) return 0; 

if (MaskMissingBits) ReNumMissingBits(l); 



20 



25 



30 



for (i=0;i<nY_01;i++) 

if (! IntersectQuery( iY_01 +i,Y_01 +i )) return 0; 
for (i=0;i<nY_02;i++) 

if (! IntersectQueryC iY_02+i,Y_02+i )) return 0; 

CurrentSlnId = 0; 

cqt = floor( (double) c_query / Tanimoto); 

q_lo = floor( (double) c_query * Tanimoto - (double) NumMissingBits[0]); 
Oi = ceiK (double) ( c_query + NumMissingBits[0]) / Tanimoto); 
/* should convert test of Dead_Products to a "UTL_SET_NEXT" approach ?7 */ 
for(i=0;i<nY_01;i + +) 
{ 

if (CurrentSlnId > Last) break; 

if (cY_01[i] > cqt) ( CurrentSlnId + = nY_02; continue;} 

carhold = q_lo - cY_01[i]; 
inthold = q_lo - iY_01[i]; 
for (j=0;j<nY_02;j + +) 
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{ 

if (User Aborted) return 1; 

if (CurrentSlnId > Last) break; 
5 if (CurrentSlnId < First) { CurrentSlnId+ + ; continue; } 

if (cY_02a] > cqt) { CurrentSlnId+ +; continue; } 

if (cY_02a] < carhold) { CurrentSlnId+ + ; continue; } 

if (inthold > iY_02[j]) ( CurrentSlnId+ + ; continue; } 

if (TestDead(0,CurrentSlnId)) { CurrentSlnId+ + ; continue; } 
10 ActuallyCompute( i, j, &onion, &intsc, &max); 

if (max > = Tanimoto) 

{ 

OutputThisHit(i,j, onion, intsc, max); 
nProcessed+ + ; 

j5 if (nProcessed > = NoMorehitsPlease) return 1; 

} 

CurrentSlnId + + ; 
} /* Y_02 loop */ 
} /* Y_Ol loop */ 
20 } /* while stil queries left */ 
return 1; 

} 

int ReNumMissingBits( int howmany ) 
{ 

25 for ( ; howmany ; howmany--) 

ReNum(MaslcMissingBits.query,WordsPerFingerprint,&(NumMissingBits[howmany-l]) 

); 
} 

int ReNumOnt *mask, int*query, int len, int *missed) 
30 { 

unsigned char *one, *two; 
unsigned char *masq; 

masq = (unsigned char *) MaskQueryBits; 
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one = (unsigned char *) mask; 
two = (unsigned char *) query; 
*missed = 0; 
len *= 4; 

5 for ( ; len ; len-) *missed + = nbits[ (*masq+ + = -one+ + & nwo+ +) ]; 
return 1; 

Lt ActualIyCompute( indexl, index2, pUnion, pintersection, pMaxTan) 
int indexl, index2, *pUnion, ^pintersection; 
10 double *pMaxTan; 
{ 

int i, product; 

unsigned char *hl, *h2, *hquery, *masq; 
int nuMissing; 
15 if (DebugLevel) 

fprintf( stderr," ActuallyCompute at %d , %d\n", indexl, index2); 

hi = (unsigned char *) Y_01 [indexl]; 
h2 = (unsigned char *) Y_02[index2]; 
hquery = (unsigned char *) query; 
20 *pUnion = *pIntersection = 0; 

if (mASCII) {nuMissing = 0; masq = (unsigned char *) MaskQueryBits; } 
else {nuMissing = NumMissingBits[0];} 
for( i=0; i<WordsPerFingerprint*4;i++) 

{ 

25 product = *hl4-+ | *h2++ ; 

*pUnion += nbits[ product j *hquery]; 

if (mASCII) 

nuMissing + = nbits[ - product & *masq+ +]; 
*pIntersection += nbits[ product & *hquery++]; 

30 } 

if (DebugLevel > 9) fprintf(stderr,"%d / %d %6.3f\n", 

*pIntersection, *pUnion, 
(double) *pIntersection / *pUnion); 
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return (^pMaxTan = (double) (^plntersection + nuMissing) / (double) *pUnion); 

int OutputThisHit( indexl, index2, onion, intsc, maxtan) 
int indexl, index2, onion, intsc; 
5 double maxtan; 
{ 

if (OutputFile) ,.^00-1 
fprintf(OutputFile,"%6d %6d %5.3f %5.3f\n", indexl + 1 ,xndex24-l , 

(double) intsc / (double) onion, 
maxtan); 

10 

/* just note in bitset as a hit */ 
FlagProduct(Good_Products, indexl, index2, 0); 

return 1; 

} 

15 static int not_here( what, nbytes ) 
unsigned char *what; 
int nbytes; 
{ 

for ( ; nbytes; --nbytes) *what4-+ - ~*what; 
20 return 1; 
} 

/* this belongs in the utl module, actually */ 

int MakeComLine( char *line, int len, int argc, char -argv) 

{ 

25 int i; 

sprintf(line,"%s ",argv[0]); 

for(i = l;i<argc;i + +) 
{ 

line -l-= strlen(line); 
30 sprintf(line,"%s ",argv[i]); 

} 

} 

CheckPointProgram(programName) 
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char *programName ; 
{ 

int sizes[2] , size; 
int allocSizes[2] ; 
5 int numInSites[2] ; 
char hold[81] ; 
int i ; 

void *compressed ; 
int total ; 

10 



15 



20 



30 



for ( i = 0 ; i < Totallnputs ; i + + ) 
{ 

sizes[01 = Y_01_Length[i] ; 

sizes[l] = Y_02_Length[i] ; 

numInSites[0] = numInSites[l] = -1 ; 

allocSizes[0] = allocSizes[l] = -1 ; /* should keep bitset 

allocSizes if present?*/ 

compressed = NIL; 
total = 0; 

WriteOutCheckPointFile(OutputCheckpointNames[i], 

MasterFile ? InputNames[i] 

: MasterFile_Bitset[il, 
MasterFile ? InputStartRec[i] 
: StartRec_Bitset[i], 

programName, 

Good_Products, 

BitMapStartPoint[i], 

2, 

sizes, 

allocSizes, 

Selections[i], 

numlnSites, 

total, 

compressed); 
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Ap pendix "I" 



5 / 



*/ 

*/ 

* dbcslnquickselect 



; ................ — • - 



/*+c 



10 * This program evaluates (approximate) Tanimoto 2D similarity vs cSLNs 

* based on preprocessing of the substituent reagents. Using this, it 

* selects a diverse set of products while trying to maximize use of 

* some groups. 

15 *ToDo: 

* Following ADS group suggestions, order the reagent fp by size (fpcard). 

* To be added: restart capability and reagent blackout. 

VI have one line per 

20 * The input files, one per Xi, az, 

* structure and contain the elements "fpcard =xxx;" and "fp=zzz;" where 

* the terminating may also be " > The integer value of fpcard is 

* the cardinality of the fingerprint; the hex value of fp is the 

* fingerprint bitstring as two ascii bytes per bitset byte. 



25 



* Queries are input as SLN repeatedly from stdin; ending on ^D or X 



* The resultant file contains one line per hit, of the form 

* Y 1 Y2 T Tmax 

30 * where Yl = index of the substituent in XL pro file 

* Y2 = index of the substituent in X2.pro file 

* T = apparent Tanimoto similarity 

* Tmax = maximum possible Tanimoto, given the slop bits (see below) 
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* dbcslnquickselect -prefix <name 



> -Tanimoto <real> -prefer <what> -append 



-slop 



<int> -maxhits <int> -output <name> +debug 



* Options: 
* 

* -prefix name 



10 * 

* 



- name is the prefix for a set of 2 files 
with extensions .XI. pro .X2.pro 
; files have fingerprints 
(someday) will reload from prefix.RELOAD if present 



15 * 



20 * 

* 
* 

25 * 



* 
* 

30 * 



-Tanimoto tan 

-prefer 
-slop bitcount 



- tan is a Tanimoto similarity 0.0 - 1.0 
(default is 0.85) 



one 



of R1,R2 else random. Rl maximizes use of Rl 



-maxhits max 



- bitcount is the number of bits in the 
product fingerprint that may not be 
represented by ORinf XI X2 (default 0) 

- stop when max hits are found (default infinity) 



-output filename - specifies the output file for the hit info 
by default results are sent to stdout. 



-append 

-1- debug 

-rangevar 

-oneof 



- append results to an existing output file 
By default an output file is overwritten. 

- writes irrelevant info to stderr 

- List of field names and ranges to filter 
the final list with. 

- List of field names and values that the product 



311 

should match in order to be considered. 

This flag forces the display of all 
options 



................... 

/ 

/* use 3db 

* dbcc dbcslnquickselect.c -o dbcslnquickselect *l 



10 ^include <stdio.h> 
^include <signal.h> 
#include <ctype.h> 
^include <unistd.h> 
#include <string.h> 
15 #include <sys/stat.h> 
#include <math.h> 
^include "parseopt.h" 
^include "utLstr.h" 
#include "utl_mem.h" 
20 #include "utl_file.h" 
^include "utl_math.h" 
^include "ct.h" 
^include "ct_expr.h" 
^include "ct_proto.h" 
25 #include "import_proto.h" 



^define GoodExit 0 
^define ErrorExit 1 

^define Visual(s) { fP^^tf s; } 



#define ALLOCATEJNCREMENT 5 
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#define MISSING_FLOAT_VALUE 
#define MIS SING_INT_ VALUE 
#define NOT_A_MATCH_VALUE 

#define SMALL_FLOAT 



-100000000.00 

-1 

-2 



5 /* 

** Command line argument -rangevar and -oneof are kept here. 
*/ 

static char *RangeVar ; 

static char *OneOfVar ; 

10 /* 

- Structure to hold the field name(inside the nnn.x? files) and the allowed 

** range for that field. 

*/ 

typedef struct RangeStruct 
15 { 

char *RangeFieldName ; 
float lowValue ; 
float highValue ; 
} RangeStruct ; 

20 int NumRangeFields ; 

int NumRangeFieldsAllocated ; 
RangeStruct *RangeFields ; 

/* 

- Structure to hold the field name and a list of values for the selection 
25 ** type fields. 
*/ 

typedef struct OneOfStruct 
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{ 

char *OneOfFiel(lName ; 
int numValues ; 
int numValuesAlloc ; 
5 char **values ; 
} OneOfStruct ; 

int NumOneOfFieldsAUcxjated ; 

int NumOneOfFields ; 

OneOfStruct *OneOfValues ; 

10 float -RangeValues_Y01 ; Actual values read in from nnn.Xl file, 

If MW is the first and logp is the second value 
specified on the -rangevar argument list then 
RangeValues_Y01[n][0] would keep the value for MW 
for the nth line in the nnn.Xl file and 

^5 RangeValues_Y01[n][l] would keep the value for 

logp for that line*/ 

float **RangeValues_Y02 ; /* same */ 

int -OneOfValues_Y01 ; /^Actual values read from nnn.Xl files but translated 

into an index of OneOfValues[i]. values so 
20 we dont have to waist memory and time doing strcmp*/ 

int **OneOfValues_Y02 ; /* Same */ 



static FILE *OutputFile; 

static char *OutputFileName; 

static char *WhatFirst; 

25 static int Whatl = -1; 

static int What2; 
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*PrefixForFiles; 



static char 

static char *InputSource = 0; 

static FILE *InputSourceFile; 



• * n -^o KUc Asrn-ed into %.8x format */ 
/* Code presumes that an mt is 32 bits, AbUii ea miu 



5 static int 
static int 
static int 
static int 
static int 
10 static int 
static int 
static int 
static int 
static int 



/* fingerprints */ 
/* " */ 
/* " */ 
/* number of structures */ 

" */ 
/* cardinality of fingerprints */ 

t. */ 
*iY 01; /* intersection count of fprints "^1 
*iY_02; /* ^ 



**Y_01; 

**Y_02; 

*query; 

nY^Ol; 

nY_02; 
*cY_01; 
*cY__02; /* 
c_query;/* 



15 static int 
static int 
static int 
static int 
static int 

20 static int 



*Good__l; 

*Good_2; 

*Dead_l; 

*Dead_2; 

*Good_Products; 

*Dead_Products; 



static int 
static int 

static double 
static int 
25 static int 
static int 
static int 
static int 



nbits[256]; 
setbits[8]; 

Tanimoto = 0.85; 
BitsInAbsentia = 0; 
AppendToOutputFile = 0; 
WordsPerFingerprint = 0; 
BytesPerFingerPrint = 0; 
NoMorehitsPlease = 999999999; 



static int 
static int 
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DebugLevel = 0 ; 
User Aborted; 



static int 
static int 
5 static char 



nProcessed = 0; 
SomeLeft; 

next8[101 = "01234567\0"; 



static 



struct ParseOptions OptionsQ - { 



... DO NOT MOVE ENTRIES IN THIS TABLE. ADD ENTRIES ONLY AT THE 

END. 
10 ***/ 

{••prefix- , ParscOptString, &PrefixForFiles, 
"Prefix for all input files" }, 

("Tanimoto", ParseOptDouble, &Tanimoto, 
"Similarity threshold (0.0 to 1.0)" }, 

{"slop" , ParseOptlnt, &BitsInAbsentia, 

"Number of potentially missing bits in product fp" }, 

{"maxhits", ParseOptlnt, &NoMorehitsPlease, 
•■Maximum number of hits before stopping" }, 

{"input", ParseOptString, &InputSource, 

"File from which queries will be read( default stdin). 

{"output" , ParseOptString, &OutputFileName, 
"File to which hit info will be written. "}, 



15 



20 



{"prefer", ParseOptString, 



&WhatFirst, 
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"One of Rl, R2 to maximize us of."}, 

{"append". ParseOptNoArg, &AppendToOutputFile, 
"Use -append to append results to an existing file" }, 

{"debug", ParseOptBoolean, &DebugLevel, 
5 "Use +debug to enable debugging messages" }, 

{"rangevar", ParseOptString, &RangeVar, 

"Scalar field name and range to filter out, i.e. logp -1.0 8.0 MW 200 500 

price 0 12.50" }, 

{"oneof, ParseOptString, &OneOfVar, 
10 "Field name and list of values that the product should match\n, i.e. supplier 

Aldrich,Sigma,Fluka,SALOR taste SWEET,Salty" }, 



}; 

int UBS_OUTPUT_MESSAGE() { return 0; } /* just for compiling OK */ 
int UIMS2_WRITE_PHOTO0 ( return 0; } 
15 int lowerca'se (s) char *s; {while (^s) { if isupper(^s) *s = tolower(*s)-, s+ + ;}} 

static void UserHitControlCO 



* 



* This function is the signal handler for user initiated program termination. 
20 * It's only role is to set a flag indicating that the user wishes to abort the program. 



*/ 



( 

25 User Aborted = 1; 

} 
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10 



/* 

** Abstract 

** 

Usage 
** Returns 



: Function parses range field string for ADS design programs. 
It takes a string of the form 

"logp -1.0 8.0 MW 200 500 price 0 12.50" and fills in the 
global array RangeFields, 



: 1 on success, 0 for failure. 



15 ** Algorithms : None. 



** Revision History 



20 **-E: 
*/ 

intParseRangeVar(rangeVar,numRangeFieldsAllocated,numRangeFields,rangeFields) 

char *rangeVar ; 
int *numRangeFieldsAllocated ; 
25 int *numRangeFields ; 

struct RangeStruct **rangeFields; 

{ 

static int stat = 0 ; 
char *buffer = (char *)NULL ; 
30 char *name ; 
char *low ; 
char *high ; 





318 



int i ; 



*numRangeFieldsAllocated = 0 ; 

*numRangeFields = 0 ; 

*rangeFields = (struct RangeStruct *)NULL ; 



5 



if ( '.(buffer = UTL_STR_SAVE(RangeVar)) ) 



goto Failure ; 



name = strtok(buffer," "); 

while ( name ) 

{ 

if ( !(low = strtokCNULL," ")) ) 

goto UnableToParse ; 
if ( !(high = strtok(NULL," ")) ) 

goto UnableToParse ; 
if ( *numRangeFields > = *numRangeFieldsAllocated ) 



15 



if ( !*rangeFields ) 



if (!(*rangeFields = (struct RangeStruct 



*)UTL_MEM_CALLOC( 



20 



ALLOCATEJNCREMENT, 



sizeof(struct RangeStruct)))) 
goto Failure ; 



else 



25 



*numRangeFieldsAllocated = 



ALLOCATEJNCREMENT ; 



else 



30 



if (!( *rangeFields = (struct RangeStruct 
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*)UTL_MEM_RECALLOC( 

RangeFields, 

(*numRangeFieldsAllocated *sizeof(struct RangeStruct)), 
((*numRangeFieldsAllocated + ALLOC ATEJNCREMENT) 
5 sizeof(struct RangeStruct)) )) ) 

goto Fmlure ; 

else 

*numRangeFieldsAllocated + = 

ALLOCATEJNCREMENT ; 

10 } 
} 

RangeFields[*numRangeFields].RangeFieldName - 

UTL_STR_S A VE(name) ; 

RangeFields[*numRangeFields].lowValue = atof(low); 

j5 RangeFields[*numRangeFields].highValue = alof(high); 

(*numRangeFields)++ ; 

name = strtok(NULL," "); 

} 

if (DebugLevel) 
20 ( 

for ( i = 0 ; i < *numRangeFields ; i++ ) 
{ 

fprintf(stderr,"\n%s %f- %f', 
RangeFields[i] . RangeFieldName, 
25 RangeFields[i].lowValue, 
RangeFields[i] . high Value) ; 

} 

} 

Stat = 1 ; 
30 goto Cleanup ; 



UnableToParse: 
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ntf(stderr,"Unable to parse -rangevar %s\n",RangeVar); 



Stat = 0 ; 
goto Cleanup ; 
Failure : 
5 Stat = 0 ; 

goto Cleanup ; 

Cleanup : 

if ( buffer ) 

UTL_MEM,FREE(buf fer) ; 

10 return stat ; 
} 

/* 



15 




** Abstract 



It takes a string of the form 




20 



** Usage 



** Returns 



: 1 on success, 0 for failure. 



25 



Algorithms : None. 



** Revision History : 



30 
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*/ 

static int 

ParseOneOfVar(oneOfVar,n„,„OneOfFieldsAlloca.ed.numOneOfFields,oneOfVal„es) 

char *oneOfVax ; 
5 int *numOneOfFieldsAllocated ; 
int *numOneOfFields ; 
struct OneOfStruct **oneOfValues; 

{ 

static int stat = 0 ; 
10 char *buffer = (char *)NULL ; 

char *name ; 

char ^choices ; 

char *choice ; 

int i ; 
15 int j ; 

char *cp ; 

char *encl ; 



*numOneOfFieldsAllocated = 0 ; 
*numOneOfFields = 0 ; 
20 (*oneOfValues) = (struct OneOfStruct *)NULL ; 

if ( !(buffer = UTL_STR_SAVE(OneOfVar)) ) 
goto Failure ; 

/* 

** Start off by reading the field name , 
25 */ 

name = strtok(buffer," "); 
while ( name ) 

^ if ( *numOneOfFields > = *numOneOfFieldsAllocated ) 
30 { 
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if ( !(*oneOfValues) ) 
{ 

if (!(*oneOfValues = (struct OneOfStruct 

*)XJTL_MEM_CALLOC( 

5 

ALLOCATEJNCREMENT, 

sizeof(struct OneOfStruct)))) 
goto Failure ; 

else 

*numOneOfFieldsAllocated = 

ALLOCATEJNCREMENT ; 
} 

else 

{ 

if (!( *oneOfValues = (struct OneOfStruct 

*)UTL_MEM_RECALLOC( 

*oneOfValues, 

(*numOneOfFieldsAllocated *sizeof(struct OneOfStruct)), 
((*numOneOfFieldsAllocated + ALLOCATEJNCREMENT) 
2Q sizeof(struct OneOfStruct)) )) ) 

goto Failure ; 

else 

*numOneOfFieldsAllocated + = 

ALLOCATEJNCREMENT ; 

25 } 
} 

(*oneOfValues)[*numOneOfFields].OneOfFieldName - 

UTL_STR_SAVE(name); 

(*oneOfValues)[*numOneOfFields].numValues = 0 ; 

3Q (*oneOfValues)[*numOneOfFields].numValuesAlloc = 

ALLOCATEJNCREMENT ; 

if ( !((*oneOfValues)[*numOneOfFields]. values = (char **) 
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10 



15 



UTL_MEM_CALLOC(ALLOCATE_INCREMENT, 
sizeof(char *)) ) ) 

goto Failure ; 

5 /* 

** Now look at the choices this field could have. 
*/ 

choices = strtok(NULL," "); 
if ( '.choices ) 

goto UnableToParse ; 
choice = strtok(choices,","); 
while ( choice ) 
( 

if ( (*oneOfValues)[*numOneOfFields]. numValues > - 

(*oneOfValues)[*numOneOfFields] . numValuesAlloc ) 

if ( !((*oneOfValues)[*numOneOfFields].values = (char **) 

UTL_MEM_RECALLOC((*oneOfValues)[*numOneOfFields].values, 

( 

20 (*oneOfValues)[*numOneOfFields]. numValuesAlloc * 

sizeof(char *)), 
( ((*oneOfVaIues)[*numOneOfFields].numValuesAlloc + 

ALLOCATEJNCREMENT ) 

*sizeof(char *)) ) )) 
2^ goto Failure ; 

(*oneOfValues)[*numOneOfFields] .numValuesAlloc + = 

ALLOCATEJNCREMENT; 

} 

(*oneOfValues)[*numOneOfFields].values[(*oneOfValues)[*numOneOfFieldsl.numValues] 
30 = UTL_STR_SAVE(choice); 
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(*oneOfValues)[*numOneOfFields].numValues+ + ; 
end = choice + strlen(choice) + 1 ; 
choice = strtok(KULL,","); 

} 

5 (*numOneOfFields) + + ; 

name = strtok(end," "); 

} 

if (DebugLevel) 
{ 

10 for ( i = 0 ; i < *numOneOfFields ; i + + ) 

^ fprintf(slderr,"\n%s (*oneOfValues)[i].OneOfFieldName) ; 
for ( j = 0 ; j < (*oneOfValues)[i].numValues ; j + + ) 

fprin'tf(stderr,"\n %s",(*oneOfValues)[i].valuesDl); 

15 } 

fprintf(stderr,"\n"); 

} 

Stat = 1 ; 
goto CleanUp ; 

20 UnableToParse: 

fprintf(stderr,"Unable to parse -oneof %s\n",OneOfVar); 

Stat = 0 ; 
goto Cleanup ; 
Failure : 
25 Stat = 0 ; 

goto Cleanup ; 
Cleanup : 

if ( buffer ) 

UTL_MEM_FREE(buffer) ; 

30 return stat ; 

} 



**+E: 

stent 
*3*e 



5 ** Abstract 
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: Function parses a line from the input file and extracts 
out any rangevar or oneof fields. 



** Usage : 
10 ** 

** Returns : Always returns I ; 
** 

** Algorithms : None. 



15 ** Revision History 



**-E: 



intReadLineAttributes(line,numRangeFields,rangeValues,rangeFields,numOneOfFid^ 



20 int 

oneOfValues,oneOfFields) 

char *line ; 
int numRangeFields ; 
float **rangeValues; 
25 struct RangeStruct *rangeFields; 
int numOneOfFields ; 
int **oneOfValues; 
struct OneOfStruct *oneOfFields; 

{ 

30 int i ; 
intj ; 
char *cp ; 
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** Now read in the salar selection fields if any 
*/ 

if ( numRangcFields ) 
5 { 



if ( !(*rangeValues = (float *)UTL_MEM_CALLOC(numRangeFields, 

sizeof(float)) ) ) 

return 0 ; 

10 } 

if ( numOneOfFields ) 

^ if ( !(*oneOfValues = (int *)UTL_MEM_CALLOC(numOneOfFields, 

15 sizeof(int)) ) ) 

return 0 ; 

} 

for ( i = 0 ; i < numRangeFields ; i++ ) 

20 if ( ( cp = strstr(line,rangeFields[i].RangeFieldName ) ) ) 

{ 

- Move past the logp= to get the value of this field, if the value is 
** a ';' then it is a missing value. 
25 */ 

cp = cp + strlen(rangeFields[i].RangeFieldName) + 1 ; 

if ( *cp ==';') 

(*rangeValues)[i] = MISS1NG_FL0AT_VALUE ; 



30 

} 

else 

{ 



else 

(*rangeValues)[il = atof(cp); 
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(*rangeValues)[i] = MISSING_FLOAT_VALUE ; 

} 

} 

5 - Parse the -oneof field, we are looking for something looking like 
** " supplier =Aldrich" 
*/ 

for ( i = 0 ; i < numOneOfFields ; i + + ) 
^0 ^ if ( ( CP = strstr(line,oneOfFields[i].OneOfFieldName ) ) ) 

^ CP = cp + strlen(oneOfFields[i].OneOfFieldName) + 1 ; 

if ( *cp == ) 

(*oneOfValues)[i] = MISS1NG_INT_VALUE ; 

15 ^^^^ 

for ( j = 0 ; j < OneOfValues[i].num Values ; j + + ) 
{ 

if ( UTL_STR_NCMP_NOCASE(cp, 
2^ oneOfFields[i].values[i], 

strlen(oneOfFields[il.valuesIj])) ==0) 

{ 

(*oneOfValues)[i] = j ; 
break; 

25 } 

} 

if ( j oneOfFields[i].numValues ) 

(*oneOfValues)[il = NOT_A_MATCH_VALUE 

} 

30 } 

else 

(*oneOfValues)[i] = MISS1NGJNT_VALUE ; 
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/* 

**+E: 
5 ** 



10 



** Abstract 
** Usage 



: Function Checks to see if the given product passes the 
user supplied filters. 



** Returns 
15 Algorithms : None. 
** Revision History : 



: 1 if the product is not within range, 0 otherwise. 



20 **-E: 
*/ 

static int . 
NotWithinScalarRange(f.rstIndex,secondIndex,numRangeFields,rangeValues_Y01,rangeVal 

ues_Y02,rangeFields,numOneOfFields,oneOfValues_Y0l,oneOfValues_Y02,oneOfV« 

25 int firstlndex ; /* Index into Y_01 data */ 

int secondlndex ; /* Index into Y_02 data */ 

int numRangeFields ; 

float **rangeValues_Y01 ; 

float **rangeValues_Y02 ; 
30 struct RangeStruct *rangeFields; 

int numOneOfFields ; 
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int **oneOfValues_Y01 ; 
int **oneOfValues_Y02 ; 
struct OneOfStruct *oneOfValues ; 

( 

5 int i ; 
float total ; 



/* 

** First check the range values. 

*/ 

10 for ( i = 0 ; i < numRangeFields ; i++ ) 

{ 

** If one of the regions has a missing value, then we do not filter this 
** product. 

' if ((( rangeValues_Y01[firstIndex][i] - MISSING_FLOAT_VALUE ) 

= = SMALL_FLOAT) | | 

(( rangeValues_Y02[secondIndex][i] - MISSING_FLOAT_VALUE ) 
= = SMALL_FLOAT ) ) 
20 return 0 ; 

total=rangeValues_Y01[firstIndex][i] + rangeValues_Y02[secondIndex][i]; 

if ((total > rangeFields[i].high Value ) 1 

(total < rangeFields[i].lowValue ) ) 

{ 

25 return 1 ; 

} 

} 

for ( i = 0 ; i < numOneOfFields ; i + + ) 
{ 

30 /* 

** If the value is missing then we dont mess with this guy. 
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*/ 



■ I 



if ( ( oneOfValues Y01[firstlndex][i] == MISSING_INT_VALUE ) 
( oneOfValues_Y02[secondIndex][i] == MISSINGJNT_VALUE ) ) 

return 0 ; 

5 /* 

** If any of the regions in the product does not match the selection 
** criteria, then the product is rejected. 

if ( ( oneOfValues Y01[firstlndex][i] = = NOT_A_MATCH_VALUE ) 1 1 
(oneOfValues_Y02[secondIndex][i] == NOT_A_MATCH_VALUE ) ) 

return 1 ; 

} 



10 



return 0 ; 



} 



15 static int ParseArguments( argc, argv ) 
/*+I 

* This function parses the command line arguments. 

* 

20 * Returns: 1 on a successful command line parse, 0 otherwise. 

* Warnings: 

* Errors: 

25 * 

* See Also: 



30 */ 
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int argc; 
char **argv; 



{ 

5 



int nargs, 

noptions = sizeofC Options )/sizeof(Options[0]); 

OutputFile = stdout; 

nargs = UTL_PARSE_OPT( argc, argv, noptions, Options ); 
if( '.nargs ) goto SyntaxError; 

if (WhatFirst) 

10 { if (strstr(WhatFirst, "Rl ")) WhatFirst[0] = ' 1' ; 

if (strstr(WhatFirst,"R2")) WhatFirst[0] = '2'; 
} else { 

WhatFirst=UTL_MEM_ALLOC(2); WhatFirst[0] = '0'; } 
if ( RangeVar && ! 

,5 Par.eRangeVar(RangeVar,&NumRangeFieldsAllocated,&NumRangeFieWs,&RangeFieWs)) 

goto SyntaxError ; 
if ( OneOfVar && 

!ParseOneOfVar(&OneOfVar,&NumOneOfFieldsAllocated,&NumOneOfFields,&OneOfVal 

ues)) 

20 goto SyntaxError ; 

return 1; 

SyntaxError: 

return 0; 



25 static int OpenOutputFile() 
/*+I 
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* Returns: 1 on sucesss, else 0 



{ 



char *msg; 
FILE *fp; 



OutputFile = stdout; 
if( OutputFileName ) 

10 { 

** We need to create output files under the ownership of the REAL user not the 

** EFFECTIVE user. This only applies if setuid options are activated. 

*/ 

15 { 

struct Stat statBuff ; 

int uid ; 

int euid ; 

uid - getuidO ; 
20 euid = geteuidQ; 

stat(OutputFileName, &statBuff); 

/* 

** There are two cases 

** (1) the file to output to exists 

- Use the ownership of the current owner of the file or if you cant do that 
** do not do anything. 
** (2) The file is being created. 
** use the ownership of the REAL user. 
*/ 



25 



10 } 
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if ( access(OutputFileName, F_OK) = = 0 ) 
( /* If the file exist and the real user is the owner of the file */ 
if ( statBuff.st_uid = = uid ) 
seteuid(uid); 

} 

else 

{ /* Create the file as the REAL user */ 
seteuid(uid); 

} 

OutputFile = fopen( OutputFileName, (AppendToOutputFile?"a":"wb")); 



if( '.OutputFile ) { 

{printf(stderr, "Error: Failed to open output file \"%s\"\n" 

OutputFileName ); 

J 5 goto ErrorRetum; 

} 

} 



return 1; 



ErrorRetum: 
20 return 0; 

} 



static void CloseOutputFile() 
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* This function closes the output file. It is included just for cleanliness. 



5 */ 
{ 



fclose( OutputFile ); 

} 



int main( argc, argv ) 
10 /*+E 



int argc; 
char **argv; 

15 { 



long startTime, 
totalTime, 
finishTime; 



int numFiltered = 0 ; 



20 /*** 

*** Establish handler for a user interrupt. 

signal( SIGINT, UserHitControlC); 
#ifdef SIGHUP 



335 

signal( SIGHUP, UserHitControlC); 

#endif 

if( !ParseArguments( argc, argv ) ) 
goto SyntaxError; 

5 if( lOpenOutputFileO ) goto FailureExit; 

/* if (IRestartStateO) goto FailureExit; */ 

time( &startTime ); 

Visual((stderT, "Begin reading files: %s",ctime(&startTime))); 

/* Let's actually do something now */ 
10 if ('.ReadEverythingO) goto FailureExit; 

time( &finishTime ); 

Visual((stderr, "Begin filtering: %s",ctime(&finishTime))); 

#if 0 

DumpBitSet(Good_Products,nY_01,nY_02); 
15 DumpBitSet(Dead_Products,nY_01 ,nY_02); 

#endif 

if (!FilterProducts(&numFiltered)) 
goto FailureExit; 

#if 0 

20 DumpBitSet(Dead_Products , n Y_0 1 , n Y_02) ; 

#endif 

time( &finishTime ); 



Visual((stderr,"Filtered out %d out of %d possible products\n",numFiltered, nY_02 
* nY_01 )); 

25 Visual((stderr, "Begin selection: %s",ctime(&finishTime))); 

if (!UserAborted && !SelectEverything()) goto FailureExit; 



10 
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CloseOutputFileO; 
time( &finishTime ); 



totalTime = fmishTime - startTime; 
if( ! totalTime ) totalTime = 1; 



Visual((stderr, "Created %d Selections in nProcessed )); 

Visual((stderr,"%d Hours, %d min, %d secs\n", 
totalTime/(60*60), 
(totalTime%(60*60))/60, 

(totalTime % 60))); 

Visual((stderr,"Each comparison required %.8f seconds to calculate\n" , 
(totalTime/((double)(nProcessed?nProcessed:l))))); 



Visual((stderr,"End Quick Select Computation: %s",ctime(&fmishTime; 
UserAborted ? exit(ErrorExit) : exit(GoodExit); 

SyntaxError: 
15 exit(l); 

FailureExit: 

exit(ErrorExit); 

} 



int ReadEverythingO 
20 { 

char *hold; 
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int i; 

/* because failure here means end program run, no effort to clean up 
memory on error is included. */ 

if (IPrefixForFiles ) return 0; 
5 if (IWarmUpO) return 0; 

if (..(hold = UTL_STR_CONCATENATE(PrefixForFiles,".Xl.pro"))) return 0; 
if (! (InputSourceFile = fopen(hold,"r"))) return 0; 

if (! (nY_01 = CountLinesO)) ^^^""^ ^' 

if (! (Y_01 = (int **) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) return 0; 
10 if (!(cY_01 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_01))) return 0; 
if (!(iY_01 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_01))) return 0; 

if ( NumRangeFields ) 

^ if(!(RangeValues_Y01 = (float -) UTL_MEM_ALLOC(sizeof(float *) * nY_01))) 
15 return 0; 
} 

if ( NumOneOfFields ) 

^ if (!(OneOfValues_Y01 = (int -) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) 

20 return 0; 

} 



for (i=0;i<nY_01;i++) 

^ if (!GetNextLine( cY_01 4-i,Y_01 +i, RangeValues_Y01 + i, OneOfValues.YOl + i )) 
25 return 0; 

} 

fclose(InputSourceFile); UTL_MEM_FREE(hold); 
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if (!(hold = UTL_STR_CONCATENATE(PrefixForFiles,".X2.pro"))) return 0; 
if (! (InputSourceFile = fopen(hold."r"))) return 0; 

if (' (nY_02 = CountLinesO)) '^^""^ °' 

5 if (. (Y 02 = (int -) UTL_MEM_ALLOC(sizeof(int *) * nY_02))) return 0; 
if (KcY_02 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_02))) return 0; 
if (!(iY_02 = (int UTL_MEM_ALLOC(sireof(int ) * nY_02))) return 0; 

if ( NumRangeFields ) 

10 ^ if(!(RangeValues_Y02 = (float UTL_MEM_ALLOC(sizeof(float *) * nY_02))) 
return 0; 

} 

if ( NumOneOfFields ) 

15 ^ if (!(OneOfValues_Y02 = (int -) UTL_MEM_ALLOC(sizeof(int *) * nY_02))) 
return 0; 

} 

for (i=0;i<nY_02;i + +) 

20 ^ if (! GetNextLine(cY_02+i,Y_02-f i,RangeValues_Y02 -h i, OneOfValues_Y02 + i )) 
return 0; 

} 

fcloseanputSourcePile); UTL_MEM_FREE(hold); 

25 if (!Good_l) /* not reloaded */ 
{ i= (nY_01+31)/32 * 4; 

if (!(Good_l = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Good_l,0,i); 
if (!(Good_2 = (int UTL_MEM_ALLOC(i))) return 0; memset( Good_2,0,i); 

i= (nY_02+31)/32 * 4; 
30 if (!(Dead_l = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Dead_l,0,i); 
if (!(Dead_2 = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Dead_2,0,i); 
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i= (nY_01*nY_02+31)/32 * 4; 

if (!(Good_Pro(iucts = (int *) UTL_MEM_ALLOC(i))) return 0; 

memset( Good_Products,0,i); 
if (!pead_Products = (int UTL_MEM_ALLOC(i))) return 0; 
5 memset( Dead_Products,0,i); 

SomeLeft = nY_01 * nY_02; 

} 

return 1; 

} 

10 int WarmUpO 
{ 

int i; 

for (i=0;i<256;i++) nbits[il = (i&D + (i&2)/2 + (i&4)/4 + (i&8)/8 + 

(i&16)/16 + (i&32)/32 + (i&64)/64 + (i&128)/128 ; 
15 for (i=0;i<8;i++) setbits[i] = ( 1 < < i) & 255; 

return 1; 
} 

int CountLinesO 
{ 

20 int i; 

char *foo; 

i=0; 

while ( -1 != UTL_SCAN_GETS( InputSourceFile, "W" , &foo)) i + + ; 

rewind(InputSourceFile); 
25 return i; 
} 



static int FilterProducts(numFiltered) 
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int *numFiltered ; 
{ 

int numProducts ; 
int i ; 
5 int index 1 ; 
int index2 ; 



*numFiltered = 0 ; 



numProducts = nY_02 * nY^Ol ; 

for ( i = 0 ; i < numProducts ; i-i-+ ) 
10 { 

indexl = i / nY_02 ; /*Y_01 index */ 
index2 - i % nY_02 ; /*Y_02 index */ 



if ( NotWithinScalarRange(indexl, 
index2, 

25 NumRangeFields , 

RangeValues_Y01 , 
RangeValues_Y02 , 
RangeFields, 
NumOneOfFields , 

20 OneOfValues_Y01 , 

OneOfValues_Y02 , 
OneOfValues )) 

{ 

FlagProduct(Dead_Prod ucts , 0 , 0 , i) ; 
25 SomeLeft--; 

*numFiltered + = 1 ; 

if (DebugLevel) 

fprintf(stderr, "Filtered %d %d %d\n",i-f 1, indexl 4- ) 
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} 

return 1 ; 



5 /* 



10 



15 



** Abstract 

** Usage 
** Returns 
** Algorithms : None 



: Function will read the next line pointed to by global 
InputSourceFile and parses out finger print and any other 
scalar attributes we are filtering on. 



: 1 on success, 0 for failure. 



20 ** Revision History 



*/ 

25 int GetNextLine( pCard, pFP, rangeValues, oneOfValues ) 

int *pCard; /*(OUT) returns the cardinality of the finger pnnt */ 
int **pFP; /*(OUT) returns the finger print */ 
float **rangeValues;/*(OUT */ 
int **oneOfValues;/*(OUT 

30 { 
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char *Une; 
int words; 
int i ; 
int j ; 
5 char *cp ; 



if (.1 UTL_SCAN.GETS( InputSourceFile. -NX", "r, &line)) return O; 



ReadLineAttributes(Une, 

NumRangeFields, 

rangeValues, 
RangeFields, 
NumOneOfFields, 
oneOfValues, 
OneOfValues) ; 



15 



line = strstr(line,"fpcard = ")+strlen("fpcard = "); 
if (! UTL_STR_EXTRACT_INT(line, pCard)) return 0; 
line = strstr(line,"fip = ") + strlen("fp = "); 
UTL_SC AN_TOKENlZE(line, ' ;' , 'W') ; 
UTL_SCAN_TOKENIZE(line,' > ','\\'); 

word"^ = 'strlen(line) / 8; /* must have 32 bit int multiple */ 
20 if (IWordsPerFingerprint) 

( BytesPerFingerPrint = words*4; 
query = (int *) UTL_MEM_ALLOC( BytesPerFingerPnnt); 

WordsPerFingerprint = words;} 
if ( words ! = WordsPerFingerprint) return 0; 
*pFP = (int *) UTL_MEM_ALLOC(words * sizeof(int)); 
for (words=0;words< WordsPerFingerprint;words4- +) 

{ 

memcpy (nextS ,line, 8) ; 
line + = 8; 



25 
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sscan 



f(nexl8,"%8x", *pFP+ words); 



return 1; 
} 



5 int IntersectQuery( pintr, pFP) 
int *plntr, **pFP; 
{ 

unsigned char *ptr ,*qtr; 
int i, count; 

10 ptr = (unsigned char *) *pFP; 
qtr = (unsigned char *) query; 
for(count=0, i=0; i<WordsPerFingerprint*4;i + +) 
count += nbits[ *ptr++ & *qtr++]; 

*plntr = count; 
15 return 1; 
} 



int SelectEverythingO 
{ 

int cqt, q_lo, Oi- J' ^^h^^^' 

20 double max; 

while (nProcessed < NoMorehitsPlease && SomeLeft ) 



{ 

/* 



** What we would like to do 
25 ** in a previous run. 
*/ 

if ( llnputSource 1 i !( c_query 
{ 



is first select any selections that were found 



= SelectFromlnputFile(query)) ) 
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if (! (c_query = Selectlt(query) )) 



return 0; 



} 



nProcessed+ + ; 
SomeLeft--; 

/* then zap its neighbors and continue! */ 

for (i=0;i<nY_01;i++) 

if (! IntersectQueryC iY_01+i,Y_01+i )) 
for(i=0;i<nY_02;i++) 

if (! IntersectQueryC iY_02+i,Y_02 + i )) 



return 0; 
return 0; 



cqt = floor( (double) c_query / Tanimoto); 

q_lo = floor( (double) c.query * Tanimoto - (double) Bitsin Absentia); 
q_hi = ceiK (double) ( c_query +BitsInAbsentia) / Tanimoto); 

if ( DebugLevel ) 
15 DumpValues(nY_01,nY_02); 



for(i=0;i<nY_01;i++) 
{ 

if (cY_01[i] > cqt) 



{ 



continue;} 



20 



25 



carhold = q_lo - cY_01[i]; 
inthold = q_lo - iY_01[i]; 
for (j=0;j<nY_02;j + +) 

{ 

if (User Aborted) return 1; 

if (cY_02[j] > cqt) { 
if (cY_02(j] < carhold) { 
if (inthold > iY_02[j]) { 



continue; } 
continue; } 
continue; } 



** Do not 



need to look at it, if it has already been used, eliminated. 
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if (TestBit(Dead_Products,i*nY_02+j)) continue; 

ActuallyCompute( i, j, &onion, &intsc, &max); 
if (max > = Tanimoto) 
5 { 

FlagProduct(Dea<I_Products, i,j, 0); 

SomeLeft--; 
if ( DebugLevel ) 
( 

10 fprintf(stderr,"\nZapping %d %d",i+l,j + l); 
DumpBitSet(Dead_Products,nY_01,nY_02); 

} 

} 

} /* Y_02 loop */ 
15 } /* Y_01 loop */ 

} /* while still stuff left */ 
return 1; 

} 

int TestBit(bitset, bit) 
20 int *bitset, bit; 
{ 

int what, this; 
unsigned char *bytes; 

bytes = (unsigned char *) bitset; 

25 what = bit % 8; 
this = bit / 8; 

return (bytestthis] & setbits[what] ); 

} 
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int FlagProductCTheProducts, index l,index2, this) 

int *TheProducts; 

int indexl,index2, this; 

{ 

5 int what; 

unsigned char *Products; 

/* if (DebugLevel) 

printfr%d %d, %d, %x\n".indexl,index2,this,TheProducts);*/ 

Products = (unsigned char *) TheProducts; 

10 if ('.this ) this = indexl*nY_02 + index2; /* bit index */ 

what = this % 8; 
this /= 8; 

Products[this] = setbits[what]; 
return 1; 
15 } 

int DumpBitSet(bitSet,numY01 ,numY02) 

int *bitSet ; 
int numYOl ; 
int numY02 ; 

20 { 

int i , j ; 

unsigned char *Products = (unsigned char *)bitSet ; 
int pos ; 
int byte ; 
25 int bit ; 

int index 1 ; 
int index2 ; 



fprintf(stderr,"\n- 



-Y 02- 



"); 



5 "); 



10 
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for (i = 0 ; i < nY_02 ; i++ ) 

fprintf(stderr," %3d ",i+l); 
fiprintf(stderr,"\n 



An 



} 



for ( i = 0 ; i < (numYOl * numY02) ; i+ + ) 
{ 

index 1 = i / numY02 ; 
index2 = i % numY02 ; 

byte = i / 8 ; 
bit = i % 8 ; 



if (( index2 = = 0 ) ) 

fprintf(stderr,"\n%3d | " ,indexl + 1); 
15 fprintf(stderr," %3d ",(Products[byte] & setbits[bit])?l:0 ); 

^ \n"); 

fprintf(stderr,"\n 



int DumpValues(numY01,numY02) 
20 int numYOl ; 
int numY02 ; 

{ 

int i , j ; 
int pos ; 
25 int byte ; 
int bit ; 
int index 1 ; 
int index2 ; 
int onion ; 
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int intsc ; 
double max ; 



^rintf(stderr,"\n 

"); 

5 for(i = 0;i < nY_02 ; i++ ) 

^rintf(stderr," %5d ",i+l) 



"); 



-Y 02 ^" 



\n 

fprintf(stderr,"\n 



10 for ( i = 0 ; i < (numVOl * numY02) ; i++ ) 

{ 

index 1 = i / numY02 ; 
index2 = i % numY02 ; 



ActuallyCompute( indexl, index2, &onion, &intsc, &max); 



if (( index2 = = 0 ) ) 

fprintf(stderr,"\n%5d | " ,indexl + 1); 

fprintf(stderr," %0.3f "',max); 

^ \n"); 

fprintf(stderr,"\n 



20 } 

int FlagReagent(TheReagent, size, index) 
int *TheReagent; 
int size, index; 
{ 

25 int what, this; 

unsigned char *Reagent; 



Reagent = (unsigned char *) TheReagent; 
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what = index % 8; 
this = index / 8; 
Reagent[this] 1 setbits[what]; 
return 1; 

5 } 

int SelectFromInputFile(query) 

int *query ; 

{ 

static int firstTime = 1 ; 
10 static FILE *fp = (FILE *)NULL ; 



unsigned char *p, *q ; 
int index 1; 
int index2; 
int index ; 
15 char *line ; 
char *cp ; 

unsigned char *queryPtr ; 

if ( firstTime ) 
{ 

20 if ( !(fp = fopen(InputSource,"r"))) 

goto UnableToOpenFile ; 
firstTime = 0 ; 

} 

if (.1 UTL_SCAN_GETS( fp, "", "", &line)) return 0; 



25 if ( !( cp = strtok(line," ")) ) 

goto UnableToParseLine ; 
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if ( !( cp = strtok(NULL," ")) ) 

goto UnableToParseLine ; 

index 1 = atoi(cp) - 1 ; 

if ( !(cp = strtok(NULL," ")) ) 
5 goto UnableToParseLine 

index2 = atoi(cp) - 1 ; 



if (( indexl < 0 ) | 1 ( index2 < 0 ) ) 
goto UnableToParseLine ; 

10 If we are reading back in a selection that might have already been filtered 
** out we better adjust our counts. 
*/ 

if (TestBit(Dead_Products, index 1 *nY_02 +index2)) 
SomeLeft+ + ; 

15 p = (unsigned char *) Y_01 [indexl]; 

q = (unsigned char *) Y_02[index2]; 

c_query = 0; 

queryPtr = (unsigned char *)query ; 

for (index =0;index < BytesPerFingerPrint;index + + ,queryPtr + +) 
20 { 

*queryPtr = *p++ 1 *q++ ; 
c_query += nbits[*queryPtr & 255]; 

} 

OutputThisHit(indexl,index2); both print it and note it in bitsets */ 



25 



return c_query; 
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UnableToParseLine : 

fprintf(stderr," Unable to Parse %s\n",line); 

return 0 ; 
UnableToOpenFile : 
5 fprintf(stderr, "Unable to open file %s\nMnputSource); 

return 0 ; 

} 

/* Here the intent is to select the next compound "intelligently". 
We try to maximize use of one or the other reagent. 

10 */ 

int Selectlt(query) 

int *query; 
{ 

int i,j; 

15 if (Whatl < 0) {GrabRandom( &i, &j, query); goto out; } 

switch (WhatFirst[0]) 
{ 

case '0': 

GrabRandom( &i, &j, query); 
20 break; 
case ' 1 ' : 

GrabThis( &i, &], 1, query); 
break; 

case '2': 

25 GrabThisC &i, &], 2, query); 

break; 

} 



out: 

OutputThisHit(i,j); /* both print it and note it in bitsets *, 



352 

return c_query; 



} 



int GrabThisC pi, p2, type, fip) 

int *pl, *p2, type, *fp; 

{ 

unsigned char *p, *q, *pro; 
int index; 



switch (type) 
{ 

10 case 1: 



15 case 2: 



if (!findOne(Dead_Products, Whatl*nY_02, 1, nY_02) && 
!findOne(Dead_Products, What2 , nY_02, nY_01) && 
!GrabRandom( pi , p2, fp) ) '^^""^ 0; 

break; 

if (!findOne(Dead_Products, What2 , nY_02, nY_01) && 
!findOne(Dead_Products, Whatl*nY_02, 1, nY_02) &«& 
!GrabRandom( pi , p2. fp) ) return 0; 

break; 



20 } 

*pl = Whatl; *p2 = What2; 

pro = (unsigned char *) fp; 

p = (unsigned char *) Y_01 [Whatl]; 

q = (unsigned char *) Y_02[What2]; 



25 c_query = 0; 

for (index =0;index < By tesPerFingerPrint; index + + ,pro + +) 

{ *pro = *p++ I *q++ ; 

c_query += nbits[*pro & 255]; } 
return 1; 

30 } 
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/* This can be done more efficiently when we KNOW we are walking a vector */ 
int findOne(bitset,start,incr,max) 
int *bitset, start, incr, max; 
{ 

5 int i; 

for (i=0;i<max;i+ + , start += incr ) 

{ 

if ( TestBit(bitset, start)) continue; 
Whatl = start / nY_02; 
10 What2 = start % nY_02; 
return 1; 

} 

return 0; 

} 



15 int GrabRandom( pi, p2, fp) 
int *pl, *p2, *fp; 
{ 

int index, sum; 
int value 1, value2; 
20 unsigned char *p, *q, *pro; 

p = (unsigned char *) Dead_Products; 

index = UTL_^MATH_RAND() * SomeLeft + 1; 

value 1 = sum = 0; 

while (sum < index) 
25 {sum nbits[ -(*p++) & 255]; 
valuel 4- = 8; } 

p 1; sum -== nbits[ ^(*p) &255 ]; valuel 9; value2 = (-^(*p) & 255); 
while (sum < index) 
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{valuel ++; 

if ( value2 & 1) sum+ + ; 
value2 = value2 > > 1; } 

value2 = valuel % nY_02; 
5 valuel /= nY_02 ; 

*pl = Whatl = valuel; 
*p2 = What2 = value2; 

pro = (unsigned char *) fp; 
p = (unsigned char *) Y_01[Whatl]; 
10 q = (unsigned char *) Y_02[What21; 

c_query = 0; 

for (index =0;index < BytesPerFingerPrint; index + + ,pro+ +) 

{ *pro = *p++ i *q++ 

c query += nbits[*pro & 255]; } 

15 return 1; 

} 

int ActuallyCompute( indexl, index2, pUnion, pInter section, pMaxTan) 
int indexl, index2, *pUnion, *pIntersection; 
double *pMaxTan; 
20 { 

int i, product; 

unsigned char *hl, *h2, *hquery; 

/* if (DebugLevel) 

fprintf( stderr," ActuallyCompute at %d , %d\n", indexl, index2);* 

25 hi = (unsigned char *) Y_01 [indexl]; 
h2 = (unsigned char *) Y_02[index2]; 
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hquery = (unsigned char *) query; 

*pUnion = *pIntersection = 0; 

for( i=0; i<WordsPerFingerprint*4;i++) 

{ 

5 product = *hl + + *h2++ ; 

*pUnion += nbits[ product i *hquery]; 
*pIntersection += nbits[ product & *hquery++]; 

/* il (DebugLevel > 9) fprintf(stderr,"%d / %d %6.3f\n", 

*pIntersection, *pUnion, 
(double) *pIntersection / *pUnion); */ 
return (*pMaxTan = (double) (*plntersection + BitsInAbsentia) / (double) *pUnion); 

} 

int OutputThisHit( index 1, index2) 
15 int index 1, index2; 
{ 

int which; 

which = indexl*nY_02+index2; 

fprintf(OutputFile,"%s%d %d %d\n", PrefixForFiles, which+1, 
2Q indexl + 1 ,index2+l); 

FlagProduct(Good_Products,0,0, which); 

FlagProduct(Dead_Products,0,0, which); /* can only be selected once */ 

/* note use of reagents; this is slightly wasteful of time */ 
FlagReagent(Good_l, nY_01, index 1); 
25 FlagReagent(Good_2, nY_02, index2); 



return 1; 
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Appendix "J" 

*/ 

*/ 

<r /* dbcsln_qstop 

/*+C 
* 

10 * This program evaluates topomeric shape similarity vs cSLNs 

* based on preprocessing of the substituent reagents. Using this, it 

* selects a diverse set of products while trying to maximize use of 

* some groups. A key assumption: D-2(i,j) = Drr2(i,j) + Dr2-2(i,j) 

* i.e. the distance between products from any one reaction is the 
15 * root mean square distance of their corresponding reactants. 



To be added: restart capability and reagent blackout. 



The input files, one per XI, X2, have one line per 

20 * structure and contain the element "tp = zzz;" where 

* the terminating ";" may also be " > ". 

* The hex value of fp is the condensed representation of a CoMFA grid 

* value, 4 bits (one hex char) per grid, with interpretation as in 

* routine WhatsTheDifferenceQ. 
25 * 

* The resultant file contains one line per hit, of the form 

* Y1Y2DD1D2 

* where Yl = index of the substituent in Xl.prT file 

* Y2 = index of the substituent in X2.prT file 
3Q * D = apparent Tanimoto similarity 

*= D1,D2 = Rl, R2 distance 
* 

* dbcsln_stop -prefix <name> -distance <real> -prefer <what> -append 



* 

* Options: 
5 * 



358 

-maxhits <int> -output <name> + debug 



10 * 

* 

15 * 
* 

20 * 
* 

25 * 
* 



-prefix name 



- name is the prefix for a set of 2 files 
with extensions .Xl.prT .X2.prT 
; files have fingerprints 
(someday) will reload from prefix.RELOAD if present 



-distance dmin 



-prefer 



-maxhits max 



- dmin is the closest allowed approach 
(default is 80) 



one 



of R1,R2 else random. Rl maximizes use of Rl 
stop when max hits are found (default infinity) 



-output filename - specifies the output file for the hit info 
by default results are sent to stdout. 



-append 



+ debug 



'ft 



- append results to an existing output file 
By default an output file is overwritten. 

- writes irrelevant info to stderr 

This flag forces the display of all 
options 



30 /* use 3db 

* dbcc dbcslnquickselect.c -o dbcslnquickselect */ 
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#include <stdio.h> 
#include < signal. h> 
^include <ctype.h> 
#include <unistd.h> 
5 ^include < string. h> 
#include <sys/stat.h> 
#include <math.h> 
^include "parseopt.h" 
^include "ufl_str.h" 
10 #include "utl_mem.h" 
#include "utLfile.h" 
^include "utl_math.h" 
^include "ct.h" 
#include "ct_expr.h" 
15 ^include "ct_proto.h" 

(jfinclude "import_proto.h" 

^define GoodExit 0 
#define ErrorExit 1 

#define Visual(s) { fp^ntf s; } 



20 static FILE 



static char 



*OutputFile; 
*OutputFileName; 



static char 



*WhatFirst; 



static int 



Whatl = -1; 



static int 



What2; 



25 static char 



static char 



static FILE 



*PrefixForFiles; 
*InputSource = 0; 
*InputSourceFile; 



/* Code presumes that an int is 32 bits, ASCII-ed into %.8x 



static unsigned char 
static unsigned char 
static int 
static int 
5 static double 
static double 

static int 
static int 
static int 
10 static int 

static int 

static int 

static double 
static double 
15 static int 
static int 

static double 
static int 
static int 
20 static int 
static int 
static int 

static int 
static int 
25 static char 
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**Y 01; /* fingerprints */ 

**Y_02; /* " 

nY 01; /* number of structures */ 

nY^02; /* " 

*iY 01; /* intersection count of fprints */ 

*iY_02; /* " ^' 

*Good_l; 

*Good_2; 

*Dead_l; 

*Dead_2; 

*Good_Products; 

*Dead_Products; 

boundary [16]; 
Dist[16][161; 
setbits[8]; 
nbits[256]; 

Distance = 80.0 ; 
AppendToOutputFile = 0; 
BytesPerFingerPrint[2] ; 
NoMorehitsPlease = 999999999; 
DebugLevel; 
UserAborted; 

nProcessed = 0; 
SomeLeft; 

next8[10] = "01\0"; 



static 



struct ParseOptions Options[] - { 



10 



15 
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DO NOT MOVE ENTRIES IN THIS TABLE. ADD ENTRIES ONLY AT THE 
END. 



{••prefix", ParseOptString, &PrefixForFiles, 
"Prefix for all input files" }, 

{"distance", ParseOptDouble, &Distance, 
"Topomer distance (typically 75 to 100)" }, 

{"maxhits", ParseOptInt, &NoMorehitsPlease, 
"Maximum number of hits before stopping" }, 

{"input" , ParseOptString, &InputSource, 

"File from which queries will be read( default stdin). "}, 

{"output" , ParseOptString, &OutputFileName, 
"File to which hit info will be written. "}, 

{"prefer" , ParseOptString, &WhatFirst, 
"One of Rl, R2 to maximize us of."}, 

{"append", ParseOptNoArg, &AppendToOutputFile, 
"Use -append to append results to an existing file" }, 

{"debug", ParseOptBoolean, &DebugLevel, 

"Use +debug to enable debugging messages" }, 



20 }; 



int UBS_OUTPUT_MESSAGE() { return 0; } /* just for compiling OK */ 
int UIMS2_WRITE_PHOTO0 { return 0; } 

int lowerca'se (s) char *s; {while (*s) { if isupper(*s) *s = tolower(*s); s4- + ;}} 
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static void UserHilControlCQ 
* 

* This function is the signal handler for user initiated program termination. 
5 . It's only role is to set a flag indicating that the user wishes to abort the program. 



10 { 

User Aborted = 1; 

} 



static int ParseArguments( argc, argv ) 
/*+! 
15 * 

* This function parses the command line arguments. 

* Returns: 1 on a successful command line parse, 0 otherwise. 
20 * Warnings: 

* Errors: 

* See Also: 
25 * 
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int argc; 
char **argv; 



5 { 



10 



int nargs, 

noptions = sizeof( Options )/sizeof(Options[0]); 

OutputFile = stdout; 

nargs = UTL_PARSE_OPT( argc, argv, noptions, Options ); 
if( '.nargs ) goto SyntaxError; 



if (WhatFirst) 

{ if (strstr(WhatFirst,"Rl")) WhatFirst[0] = ' 1' ; 
if (strstr(WhatFirst,"R2")) WhatFirst[0] = '2'; 

} else { 

15 whatFirst=UTL_MEM_ALLOC(2); WhatFirst[0] = '0'; } 

return 1; 

SyntaxError: 

return 0; 

} 

20 static int OpenOutputFileQ 
/*+I 

* Returns: 1 on sucesss, else 0 
25 */ 
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char *msg; 
FILE *fp; 

OutputFile = stdout; 
5 if( OutputFileName ) 

{ 

1* We need to create output files under the ownership of the REAL user not the 
- EFFECTIVE user. This only applies if setuid options are activated. 

10 */ 
{ 

struct Stat statBuff ; 
int uid ; 
int euid ; 



15 uid = getuidO ; 

euid = geteuidO; 
stat(OutputFileName, &statBuff); 



20 



/* 



There are two cases 

(1) the file to output to exists 

Use the ownership of the current owner of the file or if you cant do that 

do not do anything. 

(2) The file is being created, 
use the ownership of the REAL user. 



25 */ 

if ( access(OutputFileName, F_OK) = = 0 ) 
{ /* If the file exist and the real user is the owner of the file */ 
if ( StatBuff. st_uid = = uid ) 
seteuid(uid); 

30 } 



5 } 



365 

else 

{ /* Create the file as the REAL user */ 
seteuid(uid); 

} 



Ou 



tputFile = fopen( OutputFileName, (AppendToOutputFile?"a":"wb")); 



if( lOutputFile ) { 

lprintf(stderr, "Error: Failed to open output file \"%s\"\n", 

OutputFileName ); 

JO goto ErrorRetum; 

} 

} 



return 1; 



ErrorRetum: 
15 return 0; 

} 



static void CloseOutputFileQ 
/* + I 

20 * This function closes the output file. It is included just for cleanliness. 
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{ 

fclose( OutputFile ); 

} 



5 int main( argc, argv ) 

/*+E 
* 



int argc; 
10 char **argv; 
( 



long startTime, 
totalTime, 
finishTime; 



15 

*** Establish handler for a user interrupt. 

signal( SIGINT, UserHitControlC); 

#ifdef SIGHUP 
20 signal( SIGHUP, UserHitControlC); 

#endif 

if( !ParseArguments( argc, argv ) ) 
goto SyntaxError; 



if( lOpenOutputPileO ) goto FailureExit; 
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/* if (IRestartStateO) goto FailureExit; */ 
time( &startTime ); 

Visual((stden-, "Begin reading files: %s",ctime(&startTime))); 

/* Let's actually do something now */ 
5 if (IReadEverythingO) goto FailureExit; 

time( &finishTime ); 

Visual((stderr, "Begin selection: %s",ctime(&finishTime))); 
if (lUserAborted && !SelectEverything()) goto FailureExit; 
CloseOutputFileO; 
10 time( (SifinishTime ); 



totalTime = finishTime - startTime; 
if( ! totalTime ) totalTime = 1; 

Visual((stderr, "Created %d Selections in nProcessed )); 

Visual((stderr,"%d Hours, %d min, %d secsXn", 
totalTime/(60*60), 
(totalTime%(60*60))/60, 
(totalTime % 60))); 

Visual((stderr,"Each comparison required %.8f seconds to 
(totalTime/((double)(nProcessed?nProcessed:l))))); 



20 



Visual((stderr,"End Quick Select Computation: %s",ctime(&fmishTime))); 
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User Aborted ? exit(ErrorExit) : exit(GoodExit); 

SyntaxError: 
exit(l); 

FailureExit: 
5 exit(ErrorExit); 

} 

int ReadEverythingO 
{ 

char *hold; 
10 int i; 

/* because failure here means end program run, no effort to clean up 
memory on error is included. */ 



if (IPrefixForFiles ) return 0; 

if (IWhatsTheDifferenceO) return 0; 

15 if O(ho.d = UTL.STR_CONCATENATE(PrefixForFUes .".Xl.prT"))) return 0; 
if (! (InputSourceFile = fopen(hold,"r"))) 1"^^""^ °' 

if (! (nY_01 = CountLinesO)) ^^^^^ ^' 

if (' (Y_01 = (unsigned char **) 

UTL MEM ALLOC(sizeof(unsigned char *)*nY_01))) return 0; 
20 if (!(iY_01 = (do'uble *) UTL_MEM_ALLOC(sizeof(double ) * nY_01))) return 0; 

for (i=0;i<nY_01;i++) 
if (! GetNextLine( Y_01 +i, 0 )) return 0; 

fclose(InputSourceFile); UTL_MEM_FREE(hold); 

25 if ('.(hold = UTL_STR_CONCATENATE(PrefixForFnes,".X2.prT"))) return 0; 
if (! (InputSourceFile = fopen(hold,"r"))) ^^^^""^ °' 
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if (! (nY_02 = CountLinesO)) ^^'""^ ^' 

if (I (Y_02 = (unsigned char **) 

UTL_MEM_ALLOC(sizeof(unsigned char *) * nY_02))) return 0; 
if (!(iY_02 = (double'*) UTL_MEM_ALLOC(sizeof(double ) * nY_02))) return 0; 
5 for (i=0;i<nY_02;i++) 

if (! GetNextLine( Y_024-i ,1)) return 0; 

fcloseanputSourceFile);UTL_MEM_FREE(hold); 

10 if (!Good_l) /* not reloaded */ 
{ i= (nY_01+31)/32 * 4; 

if (!(Good_l = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Good_l,0,i); 
if (!(Good_2 = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Good_2,0,i); 

i= (nY_02+31)/32 * 4; 
15 if (!(Dead_l = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Dead_l,0,i); 
if (!(Dead_2 = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Dead_2,0,i); 
i= (nY_01*nY_02+31)/32 * 4; 

if (!(Good_Products = (int *) UTL_MEM_ALLOC(0)) return 0; 
memset( Good_Products,0,i); 
20 if (!(Dead_Products = (int *) UTL_MEM_ALLOC(i))) return 0; 
memset( Dead_Products,0,i); 
SomeLeft = nY_01 * nY_02; 

} 

return 1; 
25 } 

int WhatsTheDifferenceO 
{ 

int i, j; 

#define pow2(a) ( (a) * (a) ) 
30 /* the assignment of codes is based on the following (from gen jls.c): 
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static fpt cutoff[161 = {9999., 0., 2., 4., 6., 8., 10.. 12., 

14., 16., 18., 20., 22., 24., 26., 30. }; 

*/ 

boundary[0] = 9999.; /* missing data ought never to occur. *l 
5 boundary[l] = -0.1 ; 

for(i=2;i< 15;i++) 
boundary[i] = 2*i-3; 
boundary[15] = 30.0; /* this is a steep curve with a cutoff at 30! */ 
for (i=0;i< 16;i++) for a=0;j< 16;j + +) 
10 Dist[i]G] = pow2( boundary[i] - boundarylj]); 

for (i=0;i<256;i++) nbits[i] = (i&D + (i&2)/2 + (i&4)/4 + (i&8)/8 + 

(i&16)/16 + (i&32)/32 + (i&64)/64 + (i&128)/128 ; 
for (i=0;i<8;i++) setbits[i] = ( 1 < < i) & 255; 



Distance *= Distance; /* want to test D"2 directly */ 



15 return 1; 
} 

int CountLinesO 
{ 

int i; 
20 char *foo; 



i-0; 

while ( -1 != UTL_SCAN_GETS( InputSourceFile, "W", &foo)) + ; 



rewind(InputSourceFile); 
return i; 
25 } 
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int GetNextLineC pFP, index) 
unsigned char **pFP; 
int index; 

{ 

5 char *line; 

int words, hold; 

if (-1 == UTL_SCAN_GETS( InputSourceFile, "W", &line)) return 0; 
line = strstr(line,"tp=")+strlen("tp="); 
UTL_SC AN_TOKENIZE(line, ' ; ' , 'W') ; 
10 UTL_SCAN_TOKENIZE(line,' > ','\\'); 

words = strlen(line) / 2; /* must have 8 bit bytes */ 
if (!BytesPerFingerPrint[index]) 
{ BytesPerFingerPrint[index] = words; 

15 } 

if ( words != BytesPerFingerPrint[index]) return 0; 

*pFP = (unsigned char *) UTL_MEM_ALLOC(words); 

for(words=0;words<BytesPerFingerPrint[index];words++) 

{ 

20 memcpy(next8,line,2); 
line + = 2; 

sscanf(next8,"%2x", Ahold); 
*(*pPP4.vvords) = (unsigned char *) hold; 

} 

25 return 1; 
} 

int IntersectQuery( pintr, pFP, query, index) 
double *plntr; 

unsigned char **pFP, **query; 
30 int index; 




372 



unsigned char *ptr ,*qtr; 
int i; 

double count; 



ptr = (unsigned char *) *pFP; 
qtr = (unsigned char *) *query; 

for(count=0.0, i=0; i<BytesPerFingerPrint[index];i+ + , ptr+ + , qtr++) 
count + = Dist[ *ptr & OxOF ][ *qtr & OxOF ] 

+ Dist[ (*ptr & OxFO) > > 41[ (*qtr & OxFO) > > 4] ; 



10 *plntr = count; 
return 1; 

} 



int SelectEverythingO 
{ 

15 int cqt, q_lo, q_hi, i, j, carhold, inthold, onion, intsc; 
double max; 

while (nProcessed < NoMorehitsPlease && SomeLeft ) 



{ 



return 0; 



if (! SelectltO ) 
20 nProcessed+ + ; 
SomeLeft--; 

/* then zap its neighbors and continue! */ 



for (i=0;i<nY_Ol;i-l-+) 
25 if (! IntersectQuery( iY_01 +i,Y_01 +i, Y_01 + Whatl.O )) 
for (i=0;i<nY_02;i++) 
if (! IntersectQuery( iY_02-Hi,Y_02+i, Y_02 + What2,l )) 



return 0; 
return 0; 



for(i=0;i<nY__01;i + +) 
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{ 



if (iY_01[i] > Distance) { 



continue;} 



10 



15 



for (i=0;j<nY_02;j + +) 
{ 

if (UserAborted) return 1; 

if (iY_02[j] > Distance) { continue; } 

if ( iY_01[il + iY_02Ij] < = Distance && 
! TestBit(Dead_Products,i*nY_02+j) ) 

{ 

FlagProduct(Dead_Products, i,j, 0); 
SomeLeft--; 

} 

} /* Y__02 loop */ 
} /* Y_01 loop */ 



} /* while still stuff left */ 
return 1; 

} 

int TestBit(bitset, bit) 
20 int *bitset, bit; 
{ 

int what, this; 
unsigned char *bytes; 

bytes = (unsigned char *) bitset; 

25 what = bit % 8; 
this = bit / 8; 

return (bytes[this] & setbits[what] ); 

} 
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int FlagProductCTheProducts, index l,index2, this) 

int *TheProducts; 

int indexl,index2, this; 

{ 

5 int what; 

unsigned char ^Products; 



/* if (DebugLevel) 

printf("%d %d, %d, %x\n",indexl,index2,this,TheProducts);*/ 

Products = (unsigned char *) TheProducts; 

10 if ('.this ) this = indexl*nY_02 + index2; /* bit index */ 
what = this % 8; 
this /= 8; 

Products[this] 1 = setbits[what]; 
return 1; 
15 } 

int FlagReagent(TheReagent, size, index) 
int *TheReagent; 
int size, index; 
{ 

20 int what, this; 

unsigned char *Reagent; 

Reagent = (unsigned char *) TheReagent; 



what = index % 8; 
this = index / 8; 
25 Reagent[this] | = setbits[what] ; 
return 1; 

} 



375 

/* Here the intent is to select the next compound "intelligently". 
We try to maximize use of one or the other reagent. 

*/ 

int SelecatO 
5 { 

int i,j; 

if (Whatl < 0) {GrabRandom( &i, &j); goto out; } 

switch (WhatFirst[0]) 
{ 

10 case '0': 

GrabRandom( &i, &j); 
break; 

case T: 

GrabThis( &i, &j, 1); 
15 break; 
case '2': 

GrabThis( &i, &j, 2); 
break; 

} 

20 out: 

OutputThisHit(i,j); /* both print it and note it in bitsets */ 
return 1; 

} 

int GrabThis( pi, p2, type) 
25 int *pl, *p2, type; 
{ 

unsigned char *p, *q, *pro; 
int index; 
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switch (type) 
{ 

case 1: 



case 2: 



10 



if (!findOne(Dead_Products, Whatl*nY_02, 1, nY_02) && 
!findOne(Dead_Products, What2 , nY_02, nY_01) && 
!GrabRandom(pl,p2) ) return 0; 

break; 

if (!fmdOne(Dead_Products, What2 , nY_02, nY_01) && 
!findOne(Dead_Products, Whatl*nY_02, 1, nY_02) && 
!GrabRandom(pl, p2) ) return 0; 

break; 



} 

*pl = WhatI; 
15 *p2 = What2; 



return 1; 



} 



I* This can be done more efficiently when we KNOW we are walking a vector / 
20 int findOne(bitset,start,incr,max) 
int *bitset, start, incr, max; 
{ 

int i; 

for (i=0;i<max;i+-f , start += incr ) 
25 { 

if ( TestBit(bitset, start)) continue; 
Whatl = start / nY_02; 
What2 = start % nY_02; 
return 1; 

30 } 

return 0; 
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} 

#else 

int findOne(bitset,start,incr,max) 
int *bitset, start, incr, max; 
5 { 

static int oldstart = -1234, 
oldincr, 
old_i; 

int i; 

10 if ( (start ! = oldstart) ! 1 (incr ! = oldincr) ) 
oldstart = start; oldincr = incr; 
old_i + + ; 

start + = incr * old_i; 

for (i=old_i;i<max;i-l- + , start += incr) 

15 { 

if ( TestBit(bitset, start)) continue; 
Whatl = start / nY_02; 
What2 = start % nY_02; 
old_i = i; 
20 return 1; 

} 

oldstart = -1234; 
return 0; 

} 

25 #endif 

int GrabRandom( pi, p2, fp) 

int *pl, *p2, *fp; 

{ 

int index, sum; 
30 int valuel, value2; 

unsigned char *p, *q, *pro; 
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p = (unsigned char *) Dead_Products; 

index = UTL_MATH_RANDO * SomeLeft + 1; 

valuel = sum = 0; 



while (sum < index) 
5 {sum + = nbits[ - (*p+ +) & 255]; 
valuel += 8; } 

p 1- sum .= nbits[ -(*p) &255 ]; valuel 
while (sum < index) 
(valuel + + ; 
10 if ( value2 & 1) sum+ + ; 
value2 = value2 > > 1; } 



-= 9; value2 = (~ (*p) & 255); 



value2 = valuel % nY_02; 
valuel /= nY_02 ; 

*pl = Whatl = valuel; 
15 *p2 = What2 = value2; 



return 1; 



} 



int OutputThisHit( index 1, index2) 
int index 1, index2; 
20 { 

int which; 



25 



which = indexl*nY_02+index2; 

fprintf(OutputFile,"%s%d %d %d\n", PrefixForFiles, which + 1, 

indexl + 1 ,index2+l); 
FlagProduct(Good_Products ,0 , 0 , which) ; 

FlagProduct(Dead"products,0,0, which); /* can only be selected once */ 
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/* note use of reagents; this is slightly wasteful of time 
FlagReagent(Good_l, nY_01, indexl); 
FlagReagent(Good_2, nY_02, index2); 

if (DebugLevel) printf(" Selection %d is %d , %d\n", 

nProcessed+ 1, index l-fl ,index2-f 1); 



return 1; 
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Ap pendix "K" 

*/ 

/* dbcsln_design 
*/ 

/*+C 

* This program evaluates (approximate) Topomer+Tanimoto similarity vs cSLNs 
10 * based on preprocessing of the substituent reagents. Using this, it 

* selects a diverse set of products while trying to maximize use of 

* some groups. Diversity is achieved by zapping all neighbors after each 

* new selection, so that any non-zapped product can freely be selected. 



* 



15 * To be added: restart capability and reagent blackout. 

* (i.e. to recomplete an earlier design and/or to remove 

all occurences of Y_01 = 37 and so on when they 

* prove to be unavailable or otherwise unsuitable). 

* Limitations: currently exactly 2 R groups are assumed. Need to extend 
20 * to more than 2 and to handle X groups. 



* The resultant file contains one line per hit, of the form 

* Yl Y2 

25 * where Yl = index of the substituent in XI. pro file 

* Y2 = index of the substituent in X2.pro file 



* Options: Look at the array Options below. 
30 * 
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/ 

^include <stdio.h> 
#include <signal.h> 
#include <ctype.h> 
5 ^include <unistd.h> 
^include < string. h> 
^include <sys/stat.h> 
#include <math.h> 
^include "parseopt.h" 
10 #include "utl_str.h" 
^include "utl_mem.h" 

#include "utl_file.h" 

^include "utl_math.h" 

^include "ct.h" 
15 #include "ct_expr.h" 

^include "ct_proto.h" 

^include "import _proto.h" 

^include "io_fprint.h" 

#include "commonData.h" /* Globals use by most functions, we will clean this 
2Q up soon */ 

^include "dbcsln_bs _proto.h" 
^include "dbcsln_hlm_proto.h" 
#define OBSOLETE_IS_OK 1 
FILE *debugFile = (FILE *) NULL ; 



25 #ifdef OBSOLETE_IS_OK 

/* these sections retain the filtering capabilities now also present 
in db_filter.c -- at some point they should exist ONLY in dbj 

*/ 

static struct RangelnfoStruct RangeValuesData ; 
30 static struct OneOflnfoStruct OneOfValuesData ; 
static struct InputlnfoStruct InputData ; 
static int NumRangeFields ; 
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15 
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static int NumRangeFieldsAllocated ; 
static RangeStruct *RangeFields ; 
static int NumOneOfFieldsAUocated ; 

static int NumOneOfFields ; 

static OneOfStruct *OneOfValues ; 

static float -RangeValues.YOl ; /* Actual values read in from nnn.Xl file, 

If MW is the first and logp is the second value 
specified on the -rangevar argument list then 
RangeValues_Y01[n][0] would keep the value for MW 
for the nth line in the nnn.Xl file and 
RangeValues_Y01[n][l] would keep the value for 
logp for that line*/ 
static float **RangeValues_Y02 ; /* same */ 

static int **OneOfValues_Y01 ; /^Actual values read from nnn.Xl files but translated 

into an index of OneOfValues[i]. values so 
we dont have to waist memory and time doing strcmp*/ 
static int **OneOfValues_Y02 ; /* Same 
#endif 

static char *MasterFile ; 
20 static char *MasterFileList ; 

static char *BitsetFileList ; 

static char *MasterRecord ; 

static FILE *MasterFile_File; 

static char *FngrFile; 
25 static int FingerCore_Card; 

static int *FingerCore_FP; 

static char *RangeVar ; 
static char *OneOfVar ; 
static double Tanimoto = 0.85; 

30 static int WordsPerFingerprint = 0; 



static int 
static int 
static int 
static int 

5 static char 
static char 
static char 
static char 
static char 

10 static char 
static char 
static int 
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BytesPerFingerPrint ^ 0; 

NoMorehitsPlease - 999999999; 

DebugLevel; 

UserAborted; 

*OutputFileName; 

*CheckPointFileName; 

*WhatFirst; 

*InputSource = 0; 

*BitsetSource = 0; 

*DatabaseNames = (char *)0 ; 



*HitlistNames = (char *)0 ; 

BitOffsets[MAX_INPUT„CSLNS]; /* why recompute? */ 

int TotalProducts ; 
static int Pro_size; 
15 static struct ParseOptions Options[] = { 

DO NOT MOVE EKTR.ES IN THIS TABLE. ADD ENTRIES ONLY AT THE 
END. 
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25 



30 



{"master" , ParseOptString, &MasterFile, 

"Name is the file with master file records" }, 
{"masterlist" , ParseOptString, &MasterFileList, 

"Name of the file containing master input/result file names" }, 
{"bitsetlist", ParseOptString, &BitsetFileList, 

"Name of the file containing bitset input/result file names" }, 
{"index", ParseOptString, &MasterRecord, 

"Which MasterRecord or Bitset entry 1-n" }, 
{"Tanimoto", ParseOptDouble, &Tanimoto, 

"Similarity threshold (0.0 to 1.0)" }, 
{"distance" , ParseOptDouble, &Distance, 

"Topomer distance (typically 75 to 100)" }, 
{"maxhits", ParseOptInt, &NoMorehitsPlease, 
"Maximum number of hits before stopping" }, 
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{••bitsef, ParseOptString, &BitsetSource, 

"Bitset file to start from"}, 
{••output" . ParseOptString, &OutputFileName, 
"File to which hit info will be written. "}, 
5 {-checkpoinl". PaneOptString, ^&CheckPom.FileName. 

"File to which bitset info will be written. "}, 
{"prefer", ParseOptString, &WhatFirst, 

"One of Rl, R2 to maximize us of."}, 
{"debug", ParseOptBoolean, &DebugLevel, 

"Use +debug to enable debugging messages" }, 
#ifdef OBSOLETE_IS_OK 

("rangevar", ParseOptString, &RangeVar, 

"Scalar field name and range to filter out, i.e. logp -1.0 8.0 MW 200 500 price 0 

12.50" }, 

15 {"oneof, ParseOptString, &OneOfVar, 

"Field name and list of values that the product should match\n, i.e. supplier 
Aldrich,Sigma,Fluka,SALOR taste SWEET,Salty" }, 
#endif 

{"database", ParseOptString, &DatabaseNames, 
2Q "Unity database to use to exclude possible products" }, 

{"hitlist" , ParseOptString, &HitlistNames, 
"Unity hitlist to use to exclude possible products" }, 

}; 
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30 



static int WarmUpQ 
{ 

int i; 

;65536;i++) BigBits[i] = (i&D + (i&2)/2 + (i&4)/4 + (i&8)/8 + 
(i&16)/16 + (i&32)/32 + (i&64)/64 + (i&128)/128 
+ (i&256)/256 +(i&512)/512 +(i& 1024)/ 1024 
+ (i&2048)/2048 

+ (i&4096)/4096 + (i&8192)/8192 + (i&16384)/16384 



for (i=0;i<< 
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+ (i&32768)/32768 ; 

setbits_nbits_InitO; 

return 1; 

} 

5 static int WhatsTheDifferenceQ 
{ 

int i, j; 

#define pow2(a) ( (a) * (a) ) 

/* the assignment of codes is based on the following (from gen_pls.c): 
10 staticfptcutoff[16] = {9999., 0., 2., 4., 6., 8., 10., 12., 

14., 16., 18., 20., 22., 24., 26., 30. }; 

*/ 

boundary[0] = 9999.; I* missing data ought never to occur. */ 
boundary[l] = -0.1 ; 
15 for(i=2;i< 15;i++) 
boundary[i] = 2*i-3; 
boundary[15] = 30.0; /* this is a steep curve with a cutoff at 30! */ 

for (i=0;i < 16;i+ -f ) for 0 =0;j < I6;j + +) 
Dist[i]tj] = pow2( boundary [i] - boundary [j]); 
20 Distance Distance; /* want to test 0^2 directly */ 
return 1; 
} 

static int CalcualteProductFingurePrint(product,firstPart,secondPart) 

int *product ; 
25 int *firstPart ; 
int *secondPart ; 

{ 

int index ; 

int totalBitsSet = 0 ; 
30 unsigned char *prod , *y01, *y02 ; 

prod = ( unsigned char *)product ; 
yOl = ( unsigned char *)firstPart ; 
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y02 == ( unsigned char *)secondPart ; 

for (index =0;index < BytesPerFingerPrint; index + + ,prod+ +) 

{ 

*prod = *y01 + + i *y02++ ; 
5 totalBitsSet + = nbits[*prod & 255]; 

} 

return totalBitsSet ; 

} 

static int IntersectQuery( pintr, pFP. pXntr, pXP, xuery, index) 
10 int *plntr, **pFP; 
double *pXntr; 

unsigned char **pXP, **xuery; 

int index; 

{ 

15 unsigned char *ptr ,*qtr; 
int i, count; 
double xount; 

if(!(*pFP) i! K*pXP)) 
return 1 ; 
20 ptr = (unsigned char *) *pFP; 
qtr = (unsigned char *) query; 
for(count=0, i=0; i<WordsPerFingerprint*4;i++) 

count += nbits[ *ptr++ & *qtr++]; 
*plntr = count; 
25 if ( xuery ) 
{ 

ptr = (unsigned char *) *pXP; 
qtr = (unsigned char *) *xuery; 

for(xount=0.0, i=0; i <XytesPerFingerPrint[index];i-f + , ptr+ + , qtr 
30 xount + = Dist[ *ptr & OxOF ][ *qtr & OxOF ] 

+ Dist[ (*ptr & OxFO) > > 4][ (*qtr & OxFO) > > 4] ; 
*pXntr = xount; 
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} 

return 1; 

Lie int ActuallyCompute( indexl, index2, pUnion, pintersection, pMaxTan, currentlnput) 
5 int indexl, index2, *pUnion, *pIntersection; 
double *pMaxTan; 
{ 

int i; 

unsigned short *hl, *h2, *hquery, product; 
10 int numberOfMissingBits ; 
if ( currentlnput == -1 ) 

numberOfMissingBits = NumMissingBits[01 ; 

else 

numberOfMissingBits = NumMissingBits[currentInput] ; 

15 hi = (unsigned short *) Y_01 [indexl]; 
h2 = (unsigned short *) Y_02[index2]; 
hquery = (unsigned short *) query; 
*pUnion = *pIntersection = 0; 

for(i=0; i<WordsPerFingerprint*2 ;i + + ,hl + + ,h2+ + ,hquery + +) 

20 { 

/* product = (*hl I *h2) ;*/ 

*pUnion += BigBits[ (*hl 1 *h2) 1 *hquery]; 
■^pintersection += BigBits[ (*hl i *h2) & *hquery]; 

25 =^pLaxTan = (double) (*pIntersection + numberOfMissingBits )/ (double) *pUnion; 
return 1; 

} 

static int 

ZapAllNeighbors(thisQuery,thisC_Query,numZapped,doCTOPS,indexl,index2,currentInput 

30 ) 

int *thisQuery ; 
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int thisC_Query ; 
int *numZapped ; 
int doCTOPS ; 
int currentlnput ; 

5 { 

int cqt, q_lo, q_hi. i- j- ^arhold, inthold, onion, intsc; 
double max; 
int k ; 

int Y_01_Offset, Y_02_Offset ; 

10 int pos ; 

int numberOfMissingBits ; 



if ( currentlnput = = -1 ) 

numberOfMissingBits = NumMissingBits[0] ; 

else 

15 numberOfMissingBits = NumMissingBits[currentInput] ; 

if ( thisQuery ) 
{ 

memcpy(query,thisQuery,BytesPerFingerPnnt) ; 
c_query = thisC_Query ; 

20 } 

*numZapped = 0 ; 

Y_01_Offset = Y_02_Offset = 0 ; 

for ( k = 0 ; k < Currentlnput ; k+ + ) 

{ 

25 Y_01_Offset += Y_01_Length[k]; 

Y_02_Offset += Y_02_Length[k]; 

} 

for (i=0;i<nY_01;i++) 
if (! IntersectQuery( iY_01+i, 

30 

iX_01+i, 
X 01+i, 
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NULL , 

return 0; 
for (i=0;i<nY_02;i++) 
if (! IntersectQuery( iY_02+i, 



(doCTOPS)?X_01 + indexl + Y_01_Offset 
0)) 



10 



NULL , 

return 0; 
/* now zap topomer neighbors */ 



Y_02+i, 

iX_02+i, 

X_02+i, 

(doCTOPS)?X_02 + index2 + Y_02_Offset 



1 )) 



15 /* 



20 



25 



30 



Only do topomer neighbors if CTOPS was present in the input. 

if ( doCTOPS ) 
{ 

Y_0l_Offset = Y_02_Offset = 0 ; 
for ( k = 0 ; k < Totallnputs ; k+ + ) 

{ 

for(i= 0 ;i< Y_01_Length[k];i++) 
{ 

if (iX_01[ i + Y_01_Offset ] > Distance) 
continue; 

for 0=0 ;j< Y_02_Length[k];j + +) 
{ 

if (UserAborted) 
return 1; 

if (iX_02[j+Y_02_Offset] > Distance) 
continue; 

if (iX_01[i + Y_01_Offset] + iX_02[j + Y_02_Offset] < 



Distance && 
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! TestBit(Dead_Products , 

BitMapStartPoint[k] + i 



*Y_02_Length[k] +j) ) 

5 ' ' { 
if (DebugLevel = = 69) 

printfCDistance kill %d %d - %f , %f + %f\n". 

i+l,j + l, iX_01[i] + iX_02[jl, iX_01[i] , iX_02[j]); 

pos = BitMapStartPoint[k] + i *Y_02_Length[k] +] ; 
FlagProduct(Dead_Products, 0,0, pos ); 
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SomeLeft— ; 



Remaininglnput[k]- ; 
(*numZapped) + + ; 
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} /* Y_02 loop */ 
} /* Y_01 loop */ 
Y_01_Offset += Y_01_Length[k] 
Y 02 Offset += Y_02_Length[k] 
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cqt = floor( (double) c_query / Tanimoto); 

q_lo = floor( (double) c_query * Tanimoto - (double) numberOfMissingBits ); 
q_hi = ceiK (double) ( c_query + numberOfMissingBits )/ Tanimoto); 

inTestBit = inActually = 0; 

Y 01 Offset = Y_02_Offset = 0 ; 



/* 



** Run thru all the input files, one at a time. 

*/ 

for ( k = 0 ; k < Totallnputs ; k+ + ) 

30 { 

for(i= 0 ;i< Y_01_Length[k];i+ +) 

{ 

if (cY 01[i + Y_01_Offset] > cqt) { 



continue;} 
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carhold = q_lo - cY_01[i + Y_01_Offset]; 
inthold = q_lo - iY_01[i+Y_01_Offset]; 
for 0=0 ;j< Y_02_Length[k];j + +) 
{ 

if (UserAborted) return 1; 



continue; } 
continue; } 
continue; } 



if (cY_02[j+Y_02_Offset] > cqt) 
if (cY_02[j+Y_02_Offset] < carhold) { 
if (inthold > iY_02[j+Y_02_Offset]) { 

10 #ifdef Wastejime 

time( &waste_time ); 

#endif 

if (TestBit(Dead_Products,BitMapStartPoint[k] + i *Y_02_I^ngth[k] +j)) 
continue; 

15 #ifdef Waste_time 

inTestBit + = time( &trash_time ) - wastejime; 

#endif 

ActuallyCompute( Y_01_Offset + i, Y_02_Offset + j, &onion, &intsc, &max,k); 

#ifdef Wastejime 
20 inActually + = time( &wastejime )- trash Jime; 

#endif 

if (max > = Tanimoto) 
( 

if (DebugLevel = = 69) 
25 printfCTanimoto kill %d %d - %6.3f , %d + %d, %d + %d\n", 
i + lj + l, max, cY_01[iLcY_02m, iY_01[i],iY_02U]); 
pos = BitMapStartPoint[k] + i *Y_02_Length[k] +j ; 
FlagProduct(Dead_Products, 0,0, pos ); 
SomeLeft— ; 
30 Remaininglnput[k]-- ; 

(*numZapped) + + ; 

if ( DebugLevel ) 

fprintf(stderr,"\nZapping %d %d %d",k+ 1 ,i+ l,j + 1); 
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} 

} /* Y_02 loop */ 
} /* Y_01 loop */ 

Y_01_Offset + = Y_01_Length[k] ; 
5 Y_02_Offset + = Y_02_Length[k] ; 

} /* Number of Inputfile loop */ 

#ifdef Wastejime 

Visual((stderr,"ActuallyCompute : %d",inActually)); 

Visual((stderr," TestBit : %d",inTestBit )); 

10 #endif 
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if ( DebugLevel ) 
{ 

int fred ; 

for ( fred = 0 ; fred < Totallnputs ; fred++ ) 

DumpValues(fred,Y_01_Length[fred],Y_02_Length[fred],ActuallyCompute) 

} 
} 



/* 

20 ** 
** 

** Abstract 



: Function zapps products who are missing CTOPS or FP fields. 



25 ** 

** Usage 

** Returns 



1 if the data value is missing or zero if the values exist. 



30 ** Algorithms : None. 
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** Revision History : 



Author Date Description 



** Fred Soltanshahi 05/21/96 Original version. 

Hit* 

*/ 

10 static int IsItMissingAValue(indexl,index2,currentInput) 

int index 1 ; 
int index2 ; 

{ 

int Y_01_Offset = 0 ; 
15 int Y_02_Offset = 0 ; 
int k ; 

for ( k = 0 ; k < currentlnput ; k + 4- ) 
{ 

Y_01_Offset += Y_01_Length[k] ; 
Y 02 Offset += Y_02_Length[k] ; 



20 



} 



if ( ( Y_01[indexl+Y_01_Offset] == NULL ) 

( Y_02[index2 + Y_02_Offset] == NULL ) 1 1 
( X_01[indexl + Y_01_Offset] == NULL ) 1 1 
25 ( X_02[index2 + Y_02_Offsetl = = NULL ) ) 



return 1 ; 
return 0 ; 



30 } 



static int GetNextLineC FILE *filePointer,FILE *fingerfp,int *pCard,int -pFP, 
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unsigned char **pXP, int index 
#ifdef OBSOLETE_IS_OK 

, float **rangeValues, int **oneOfValues 

#endif 
{ 

char *line, *fpcard, *fp, *CTOPS ; 
int words, hold; 
int pos ; 

10 if (-1 === UTL_SCAN_GETS( filePointer, "W", &line)) 
goto AddTraceback ; 

#ifdef OBSOLETEJS_OK 
ReadLineAttributes(line, 

NumRangeFields, 
15 rangeValues, 
RangeFields, 
NumOneOfFields, 
oneOfValues, 
OneOfValues) ; 

20 #endif 

/*CTOPS = strstr(line,"CTOPS-") + strlen("CTOPS = "); */ 
CTOPS - strstr(line,"CTOPS = ") ; 

if (!(*pFP = (int *) UTL„MEM_ALLOC( BytesPerFingerPrint))) 

goto AddTraceback ; 
25 if (!UTL_FILE_FREAD(pCard,sizeof(int), 1 ,fingerfp)) 

goto AddTraceback ; 
if (!UTL_FILE_FREAD( *pFP ,sizeof(int), WordsPerFingerprint ,fingerfp)) 

goto AddTraceback ; 
if ( CTOPS ) 
30 { 
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CTOPS + = strlenC'CTOPS"); 
UTL_SCAN_TOKENIZE(CTOPS,' ;' ,'\\'); 
UTL_SCAN_TOKENIZE(CTOPS,' > ','\\'); 
words = strlen(CTOPS) / 2; /* must have 8 bit bytes */ 
5 if (!XytesPerFingerPrint[index]) 

{ XytesPerFingerPrint[index] = words; 

i/( words != XytesPerFingerPrint[index]) goto MissingValue; 
*pXP = (unsigned char *) UTL_MEM_ALLOC (words); 
10 for(words=0;words<XytesPerFingerPrint[index];words-»-+) 

{ 

memcpy (next2 , CTOPS , 2) ; 
CTOPS + = 2; 
sscanf(next2, " %2x" , &hold); 
15 *(*pXP+ words) = (unsigned char ) hold; 

} 
} 



return 1; 
MissingValue : 
20 *pCard = 0 ; 

*pFP = (int *)NULL ; 
*pXP = (unsigned char *)NULL ; 
return 1 ; 
AddTraceback : 
25 return 0 ; 

} 

static int not_here( what, nbytes ) 
unsigned char *what; 
int nbytes; 
30 { 

for ( ; nbytes; --nbytes) *what4--H = -*what; 
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return 1; 

} 

/* this belongs in the uU module, actually */ 
int MakeComLine( char *line, int len, int argc, 

5 { 

int i; 

sprintf(line, " % s " ,argv[0]); 
for(i=l;i<argc;i++) 

{ 

10 line += strlen(line); 

sprintfOine/fos ",argv[i]); 

} 

} 



static void UserHitControlCQ 
15 /*+! 



This function is the signal handler for user initiated program termination. 

It's only role is to set a flag indicating that the user wishes to abort the program. 



20 * Author Date Description 

* G. B. Smith 02-09-93 Original Version 



*/ 
25 { 



UserAborted = 1; 



static int ReadEverythingO 
{ 

30 char *hold; 

char buff[255]; 
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int i; 
int j; 

char *cp ; 
char *mp ; 
5 int offset ; 
int size ; 
char *input ; 
FILE *fp ; 
char *line ; 
10 void *bitset[MAX_INPUT_CSLNS] ; 

because failure here means end program run, no effort to clean up 
memory on error is included. */ 
offset=0; 

if (IWarmUpO 1 1 !WhatsTheDifference()) return 0; 



15 if ( MasterFileList 1 1 BitsetFileList ) 

( 
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if ( ! ( fp = fopen(MasterFileList?MasterFileList:BitsetFileList,"r")) ) 

return 0 ; 
Totallnputs = 0 ; 

while ( UTL_SCAN_GETS( fp, "W", "r, &line) != -1 ) 



{ 

strcpy (buff, line); 
cp = strtok(buff," "); 
InputNamesITotallnputs] = UTL_STR_SAVE(cp); 
25 cp = strtok(NULL," "); 

InputStartRec[TotalInputs] = atoi(cp); 
cp = strtok(NULL," "); 

OutputCheckpointNames[TotalInputs++l = UTL_STR_SAVE(cp); 

} 

30 } 

else 

{ 



398 



if ( IMasterFile && IBitsetSource 8c& IMasterFileList ) 

^ fprintf(stderr,"An input me(master or bitset) must be specifiedXn"); 



return 0 ; 

} 

if ( MasterFile && BitsetSource ) 
{ 



10 



me\n"); 



} 



fprintf(stderr,"A design run can 



return 0 ; 



be run from either a master or a bitset 



if (MasterFile && '.MasterRecord ) 



{ 



15 



fprintf(stderr,"A Bitset 
return 0 ; 



(or Master) record number must be specified\n"); 



** Special case where we want to process all the records in the 
** master file. 
20 */ 

if ( atoi(MasterRecord) == -1 ) 

^ if ( ( Totallnputs = CountMasterRecords(MasterFile)) = 
goto UnableToReadMaster ; 
25 for ( i = 0 ; i < Totallnputs ; i++ ) 

{ 

InputNames[i] = UTL_STR_SAVE(MasterFile) 
InputStartRec[il = i+1 ; 

) 



30 



= = 0) 



} 

else 
{ 
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if ( MasterFile ) 

input = UTL_STR_SAVE(MasterFile); 

else 

input = UTL_STR_SAVE(BitsetSource); 

5 /* 

If there are more than one input file, process them all. 

*/ 

cp = strtok;(input," "); 
while ( cp ) 
10 { 

InputNames[TotalInputs+-+-] UTL_STR_SAVE(cp); 
cp = strtok(NULL," "); 

} 

mp = strtok(MasterRecord," "); 
j5 for ( i = 0 ; i < Totallnputs ; i++ ) 

{ 



If the user specified record numbers for all the master files, then use them 
** otherwise we will use the first record. 
20 */ 

if ( mp ) 

{ 

InputStartRecti] = atoi(mp); 
mp = strtok(NULL," "); 

25 } 

else 

InputStartRec[i] = 1 ; 

} 

} 

30 mp = strtok(CheckPointFileName," "); 

for ( i = 0 ; i < Totallnputs ; i + + ) 
{ 

if ( mp ) 
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{ 

OutputCheckpointNames[i] = UTL_STR_SAVE(mp); 
mp = smok(^aJLL," "); 

} 

5 else 

sprintf(buff , " % s_%d_chk.bs" ,basename(InputNames[i] ,NULL) ,i) ; 
OutputCheckpointNames[i] = UTL_STR_SAVE(buff); 

} 

10 } 
} 

nY_01 = nY_02 = 0 ; 

for ( i = 0 ; i < Totallnputs ; i++ ) 

{ 

15 if ( MasterFile ', | MasterFileList ) 

{ 

if ( !RetrieveMasterFile(InputNames[i], 

MasterFile_File, 
Inputs tartRec[i], 
&(NumMissingBits[i]) , 

20 

&(BitsInAbsentiaNoCount[i]) , 
&(CoreFileNames[i]), 
&(CoreStart[i]), 
(fePngrFile, 

25 &(Xlfile[i]), 

&(X2file[i]), 
&(Y_01_Length[i]), 
&(Y_02_Length[i]), 
&(fingerFP[i]), 
&(fingerOffsets[i]), 
&ScreenFileName, 
&BytesPerFingerPrint, 
&WordsPerFingerprint, 



10 



15 



20 



25 



30 
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&query, 



&FingerCore_FP, 
&FingerCore_Card) ) 

goto UnableToReadMaster ; 
Remaininglnput[i] = Y_01_Length[i] * Y_02_Length[i]; 



} 

else 



if(!(bitset[i] = cS_PRDCT_BITSET_OPEN(InputNames[i], 

InputStartRec[i])) ) 

goto UnableToReadBitset ; 
if ( !RetrieveMasterFileFromBitset(bitset[i], 

&(MasterFile_Bitset[i]) , 
&(StartRec_Bitset[i]) , 
&(NumMissingBits[i]), 
&(BitsIn AbsentiaNoCount[i]) , 

&(CoreFileNames[i]) , 

&(CoreStart[i]), 

&FngrFile, 

&(Xlfile[i]), 

&(X2file[i]), 

&(Y_01_Length[i]), 

&(Y_02_Length[i]), 

&(fingerFP[i]), 

&(fingerOffsets[i]), 

&ScreenFileName, 

&BytesPerFingerPrint, 

&WordsPerFingerprint, 

&query, 

&FingerCore_FP, 
&FingerCore_Card) ) 

goto UnableToReadBitset ; 
Remaininglnputm = CS_PRDCT_BITSET_SELECTED(bitset[i]); 
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} 

nY_01 + = Y_01_Length[i] ; 
nY_02 += Y_02_Length[i] ; 

5 if (! (Y_01 = (int **) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) 

goto UnableToAUocateMemory ; 
if (!(cY_01 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_01))) 

goto UnableToAUocateMemory ; 
if (!(iY_01 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_Gl))) 
goto UnableToAUocateMemory ; 

if (I (X_01 = (unsigned char **) 

UTL_MEM_ALLOC(sizeof(unsigned char *)*nY_01))) 

goto UnableToAUocateMemory ; 
if (!(iX_01 = (double *) UTL_MEM_ALLOC(sizeof(double ) * nY_01))) 
^5 goto UnableToAUocateMemory ; 

#ifdef OBSOLETE_IS_OK 
if ( NumRangeFields ) 

^ if(!(RangeValues_Y01 = (float UTL_MEM_ALLOC(sizeof(float *) * nY_01))) 
20 goto UnableToAUocateMemory ; 

} 

if( NumOneOfFields ) 

^ if (!(OneOfValues_Y01 = (int -) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) 
25 goto UnableToAUocateMemory ; 

} 

#endif 
/* 

** Read all the values for the XI file. 
30 */ 

for ( j = 0 ; j < Totallnputs ; j + + ) 
{ 

if ( !(fileHandles|j] = fopen(Xlfile(j],"r")) ) 
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goto UnableToOpenXlFile ; 
for (i=0;i<Y_01_Length[jl;i + +) 

{ 

if (! GetNextLine( fileHandles[j], 
^ fingerFPD], 

cY_01+i 4- offset , 
Y_01+i + offset , 
X_01+i + offset , 
0 

10 #ifdef OBSOLETE_IS_OK 

,RangeValues_Y01 + i + offset , 

OneOfValues_Y01 + i + offset 

#endif 

)) return 0; 

15 } 

offset += Y_01_Length(j] ; 

fclose(fileHandles[j]) ; 

if (! (Y_02 = (int **) UTL_MEM_ALLOC(sizeof(int *) * nY_02))) 
20 _ AO UnableToAUocateMemory ; 

if (!(cY_02 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_02))) 

goto UnableToAUocateMemory ; 
if (!(iY_02 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_02))) 
goto UnableToAUocateMemory ; 

25 if (! (X_02 = (unsigned char **) 

UTL_MEM_ALLOC(sizeof(unsigned char *) * nY_02))) 

goto UnableToAUocateMemory ; 
if (!(iX_02 = (double *) UTL_MEM_ALLOC(sizeof(double ) * nY_02))) 
goto UnableToAUocateMemory ; 



30 #ifdef OBSOLETEJS_OK 

if ( NumRangeFields ) 

{ 
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nY 02))) 



} 



if(!(RangeValues_Y02 = (float -) UTL_MEM_ALLOC(sizeof(float *) 



goto UnableToAUocateMemory 



if ( NumOneOfFields ) 
{ 



10 



nY_02))) 
} 

#endif 



if (!(OneOfValues_Y02 = (int -) UTL_MEM_ALLOC(sizeof(int *) 
goto UnableToAUocateMemory ; 



15 



offset = 0 ; 

for ( j = 0 ; j < Totallnputs ; j + + ) 
{ 

if ( !(fileHandles[j] = fopen(X2fileIj],"r")) ) 

goto UnableToAUocateMemory ; 
for (i=0;i<Y_02_Length[j];i + +) 
{ 

if (! GetNextLine( fileHandles[j], 
fingerFPLj], 

cY_02+i+offset , 
Y_02 + i+offset, 
X_02 + i + offset , 
I 

25 #ifdef OBSOLETE_IS_OK 

,RangeValues_Y02 + i + offset , 

OneOfValues_Y02 + i + offset 



20 



#endif 



)) return 0; 



30 



} 

offset + = Y_02_LengthO] ; 
fclose(fileHandlesIj]); 
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10 



15 



20 



if (!Good_l) /* note: Good_l is never used but triggers other allocations */ 
{ 

i= (nY_01+31)/32 * 4; 
if (!(Good_l = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Good_l,0,i); 
if (!(Dead_l = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Dead_l,0,i); 
i= (nY_02+31)/32 * 4; 

if (!(Good_2 = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Good_2,0,i); 
if (!(Dead_2 = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Dead_2,0,i); 

for ( size = 0 , j = 0 ; j < Totallnputs ; j + + ) 
{ BitOffsetslJ] = size; 

size += ( Y_01_Length[i] * Y_02_Length[j] ) ; 

} 

Pro_size = size = ( size + 31 )/32 * 4 ; 
if (!(Good_Products = (int *) UTL_MEM_ALLOC(size))) return 0; 

memset( Good_Products,0,size); 
if (!(Dead_Products = (int *) UTL_MEM_ALLOC(size))) return 0; 

memset( Dead_Products,0,size); 
if ( !( MasterFile j MasterFileList ) ) /* gather the dead together.... */ 

Mortuary(bitset, Totallnputs, Dead_Products, size, BitOffsets); 

} 

offset = SomeLeft = 0 ; 



Figure out the number of products for each input set and the total 
25 ** number of products. 
*/ 

for ( j = 0 ; j < Totallnputs ; j + + ) 
{ 

BitMapStartPointlj] = offset ; 
3Q offset += Y_01_Lengthlj] * Y_02_Length[j] ; 

SomeLeft += Remaininglnput[i]; 

} 
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TotalProducts = offset ; 
#ifdef OBSOLETE_IS_OK 
/* 

** Initialize the needed structures to pass around. 
5 */ 

RangeValuesData.numRangeFields = NumRangeFields ; 
RangeValuesData.rangeValues_Y01 = RangeValues_Y01 ; 
RangeValuesData.rangeValues_Y02 = RangeValues_Y02 ; 
RangeValuesData.rangeFields = RangePields ; 
10 OneOfValuesData-numOneOfFields = NumOneOfPields ; 

OneOfValuesData.oneOfValues_Y01 = OneOfValues_Y01 ; 

OneOfValuesData.oneOfValues_Y02 = OneOfValues_Y02 ; 

OneOfValuesData.oneOfFields = OneOfValues ; 

#endif 

15 inputData.totallnputs = Totallnputs ; 

InputData.Y_01_Length = Y_01_Length ; 
lnputData.Y_02_Length = Y_02_Length ; 

/* 

** Read in the -rangevar values if they are present in the csln file. 
20 */ 



#ifdef OBSOLETE_IS_OK 

if( !ReadRangeVarFromCoreFiles(TotalInputs, 



25 



return 0 ; 

#endif 
return 1; 
30 UnableToOpenXlFile : 

fprintf(stderr," Unable to open reagant file\n"); 

goto AddTraceback ; 



CoreFileNames, 
Cores tart, 
NumRangeFields, 
RangeFields) ) 
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UnableToAllocateMemory : 

lprintf(stderr, "Unable to allocate memory\n"); 

goto AddTraceback ; 
UnableToReadBitset : 
5 fprintf(stderr, "Unable to Read bitset file\n"); 

goto AddTraceback ; 
UnableToReadMaster : 

fprintf(stderr, "Unable to Read master file\n"); 

goto AddTraceback ; 
10 AddTraceback : 

return 0 ; 

} 

/* concatenate a series of compressed bitsets into one big raw bitset 
- > AND < - destroy those compressed bitsets ' 
15 int Mortuary(void *bitsetn, int nsets, int *rawbits,int byte_size, int *offset) 

{ 

int i ; 

for (i=0; i< nsets; i++) 

{ CS PRDCT_BITSET_CONCAT_RAW( bitset[i], rawbits, offset[i], 0); 
20 cS_PRDCT_BITSET_DESTROY_BIT_STRING(bitset[i]) ; 

bitset[i] = NULL; 

} 

not_here( rawbits,byte_size ); 
} 

25 static int ParseArguments( argc, argv ) 

/*+! 
* 

* This function parses the command line arguments. 



30 



* Returns: 1 on a successful command line parse, 0 otherwise. 
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* Warnings: 

* Errors: 

ate 

5 * See Also: 

* Author Date Description 

10 * G. B. Smith 02-09-93 Original Version 

*/ 

int argc; 
char **argv; 
15 { 

int nargs, 

noptions = sizeof( Options )/sizeof(Options[0]); 

OutputFile = stdout; 

nargs = UTL_PARSE_OPT( argc, argv, noptions, Options ); 
20 if( '.nargs ) goto SyntaxError; 

if (WhatFirst) 

{ if (strstr(WhatFirst,"Rl")) WhatFirst[0]='l'; 

if (strstr(WhatFirst,"R2")) WhatFirst[0] = '2'; 
} else { 

25 whatFirst=UTL_MEM_ALLOC(2); WhatFirst[0] = '0'; } 

#ifdef OBSOLETE_IS_OK 
if ( RangeVar && ! 

ParseRangeVar(RangeVar,&NumRangeFieldsAllocated,&NumRangeFields,&RangeFidds)) 

goto SyntaxError ; 

30 if ( OneOfVar && 

!ParseOneOfVar(OneOfVar,&NumOneOfFieldsAllocated,&NumOneOfFields,&OneOfValue 

s)) 

goto SyntaxError ; 
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#endif 

return 1; 
SyntaxError: 

return 0; 

5 } 

static int OpenOutputFileQ 



10 



* Returns: 1 on sucesss, else 0 



* 



char *msg; 
FILE *fp; 
15 OutputFile = stdout; 

if( OutputFileName ) 

{ 

/* 

We need to create output files under the ownership of the REAL user not the 
20 ** EFFECTIVE user. This only applies if setuid options are activated. 
*/ 
{ 

struct Stat statBuff ; 
int uid ; 
25 int euid ; 

uid = getuidO ; 
euid = geteuidO; 
stat(OutputFileName, &statBufO; 

/* 

30 ** There are two cases 

** (1) the file to output to exists 

** Use the ownership of the current owner of the file or if you cant do that 
do not do anything. 
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** (2) The file is being created. 

** use the ownership of the REAL user. 

*/ 

if ( access(OutputFileName, F_OK) = = 0 ) 
5 { /* If the file exist and the real user is the owner of the file */ 

if ( statBuff.st_uid = = uid ) 
seteuid(uid); 

} 

else 

10 { /* Create the file as the REAL user */ 

seteuid(uid); 

} 



OutputFile = fopen( OutputFileName, "wb"); 

15 if( lOutputPile ) { 

fprintf(stderr, "Error: Failed to open output file \"%s\"\n", 

OutputFileName ); 

goto ErrorRetum; 

} 

20 } 

return 1; 
ErrorRetum: 

return 0; 

} 



25 static CloseOutputFileO 
/*+I 

* This function closes the output file. It is included just for cleanliness. 
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* Author Date Description 

* G. B. Smith 02-09-93 Original Version 



5 */ 
{ 

fclose( OutputFile ); 

} 

CheckPointProgram(programName) 
10 char *programName ; 
{ 

int sizes[2] ; 
int allocSizes[2] ; 
int numTnSites[2] ; 
15 char hold[81] ; 
int i ; 

void *compressed ; 
int total ; 

for ( i = 0 ; i < Totallnputs ; i + + ) 
20 { 

sizes[0] = Y_01_Length[il ; 
sizes[l] = Y_02_Length[i] ; 
numInSites[0] = numInSites[l] = -1 ; 
allocSizes[0] = allocSizes[l] = -1 ; 

25 /* 

Lets get a compressed version of the dead products before we wnte it out 

** to file. 

*/ 

compressed = CS_PRDCT_BITSET_CREATE_BIT_STRING( 

Dead_Products, 
BitMapStartPoint[i], 
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sizes, 
sizes, 
&total); 

WriteOutCheckPointFile(0utputCheckpointNames[i], 

( MasterFile | \ MasterFileList ) ? InputNames[i] 

: MasterFile_Bitset[i], 
( MasterFile 1 1 MasterFileList ) ? InputStartRec[i] 
: StartRec_Bitset[i], 

programName, 
Good_Products, 
BitMapStartPoint[i], 

2, 

sizes, 

allocSizes, 
Selections[i], 
numlnSites, 
total, 

compressed); 

CS PRDCT_BITSET_DESTROY_BIT_STRING(compressed); 

} 

} 

int main( argc, argv ) 

/*+E 
* 

*/ 

int argc; 
char **argv; 

{ 

30 long 



startTime, 
totalTime, 
finishTime; 
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int numFiltered ; 
int numEliminated ; 
int tmp ; 

char comline[2048]; 

5 

*** Establish handler for a user interrupt. 
***/ 

signal( SIGINT, UserHitControlC); 

#ifdef SIGHUP 
10 signal( SIGHUP, UserHitControlC); 

#endif 

if( !ParseArguments( argc, argv ) ) 

goto SyntaxError; 
if( lOpenOutputFileO ) goto FailureExit; 
15 /* if (!RestartState()) goto FailureExit; */ 
time( &startTime ); 

Visual((stderr, "Begin reading files: %s",ctinie(&startTime))); 

/* Let's actually do something now */ 

if (!ReadEverything(NoMorehitsPlease)) goto FailureExit; 

20 time( &finishTime ); 

Visual((stderr, "Begin filtering: %s",ctime(&finishTime))); 
(!FilterpLucts(&InputData,&RangeValuesData,&OneOfValuesData,&numFiltered,IsItMiss 

ingAValue)) 
25 goto FailureExit; 

Currentlnput = 0 ; 
time( &finishTime ); 

Visual((stderr,"Filtered out %d out of %d possible products\n", numFiltered, 
TotalProducts )); 



30 #if 0 

/* 
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** Now see 
** for. 



if there are any hitlists or databases that you should filter 



&numEliminated , 
ZapAllNeighbors)) 



time( &finishTime ); 
5 Visual((stderr, "Begin eliminating selections in Unity database 

% s" ,ctime(«fefinishTime))) ; 

if ( !EliminateProductsFromDatabase(DatabaseNames, 



IQ goto FailureExit; 

time( &finishTime ); 
time( &finishTime ); 

Visual((stderr,"Begin eliminating selections in Unity hitlist 

% s" ,ctime(&fmishTime))) ; 
15 if ( !EliminateProductsFromHitlist(HitlistNames, 



ScreenFileName?ScreenFileName:DefaultScreenFileName, 

&tmp, 

ZapAllNeighbors)) 

goto FailureExit; 
time( &fmishTime ); 

Visual((stderr, "Eliminated %d out of %d possible products\n",numEliminated+tmp, 

TotalProducts )); 
#endif 

Visual((stderr, "Begin selection: %s",ctime(&finishTime))); 

if (! User Aborted && 

!SelectEverything(InputSource, 

NoMorehitsPlease, 
WhatFirst, 

CalcualteProductFingurePrint, 
ActuallyCompute, 
ZapAllNeighbors)) goto FailureExit; 
CloseOutputFileO; 



20 



25 



30 
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time( (fcfinishTime ); 

totalTime = finishTime - startTime; 
if( ItotalTime ) totalTime = 1; 

Visual((stderr, "Created %d Selections in nProcessed )); 
5 Visual((stderr,"%d Hours, %d min, %d secsXn", 

totalTime/(60*60), 
(totalTime% (60*60))/60, 
(totalTime%60))); 
Visual((stderr,"Each comparison required %.8f seconds to calculate\n", 
[0 (totalTime/((double)(nProcessed?nProcessed: 1))))); 

Visual((stderr,"End Quick Select Computation: %s",ctime(&fmishTime))); 
MakeComLine(comline, 2048, argc, argv); 
CheckPointProgram(comline) ; 
User Aborted ? exit(ErrorExit) : exit(GoodExit); 

15 SyntaxError: 
exit(l); 
FailureExit: 

exit(ErrorExit); 

} 
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Appendix "L" 

*/ 

*/ 

f* dbcsIn_botn 
*/ 

I* differs from dbcsln_design ONLY in combining topomers and fp as 1 metnc */ 
/* combined as 0^2 = (Ratio * CoMFA)-2 + (l-Tanimoto)-2 */ 

10 *l 

l* + C 
* 

* This program evaluates (approximate) Topomer+Tanimoto similarity vs cSLNs 

* based on preprocessing of the substituent reagents. Using this, it 
15 * selects a diverse set of products while trying to maximize use of 

* some groups. Diversity is achieved by zapping all neighbors after each 

* new selection, so that any non-zapped product can freely be selected. 

* 

* To be added: restart capability and reagent blackout. 

20 * (i.e. to recomplete an earlier design and/or to remove 

all occurences of Y_01 = 37 and so on when they 

* prove to be unavailable or otherwise unsuitable). 

* Limitations: currently exactly 2 R groups are assumed. Need to extend 

* to more than 2 and to handle X groups. 
25 * 

* The OBSOLETE file contains one line per hit, of the form 

* Yl Y2 

* where Yl = index of the substituent in XI. pro file 

* Y2 = index of the substituent in X2.pro file 

30 * 

* The REAL output is a ChemSpace bitset file. 
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* Options: Look at the array Options below. 



5 

I 

#include <stdio.h> 
#include < signal. h> 
#include <ctype.h> 
10 #include <unistd.h> 
#include < string. h> 
#include <sys/stat.h> 
^include <math.h> 
^include "parseopt.h" 
15 ^include "utl_str.h" 
#include "utl_meni.h" 
^include "utl_file.h" 
^include "utl_math.h" 

^include "ct.h" 
20 ^include "ct_expr.h" 

#include "ct_proto.h" 

^include "import_proto.h" 

#include "iojprint.h" 

^include "conimonData.h" /* Globals use by most functions, we will clean this 
2^ up soon */ 

^include "dbcsln_bs_proto.h'- 
#include "dbcsln_hlm_proto.h" 
#define OBSOLETE_IS_OK 1 
FILE *debugFile = (FILE *) NULL ; 



30 #ifdef OBSOLETE_IS_OK 

/* these sections retain the filtering capabilities now also present 
in db filter.c -- at some point they should exist ONLY in db_ 
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*/ 

static struct RangelnfoStruct RangeValuesData ; 
static struct OneOflnfoStruct OneOfValuesData ; 
static struct InputlnfoStruct InputData ; 
5 static int NumRangeFields ; 

static int NumRangeFieldsAlIocated ; 
static RangeStruct *RangeFields ; 
static int NumOneOfPieldsAUocated ; 

static int NumOneOfFields ; 

10 static OneOfStruct *OneOfValues ; 

static noat **RangeValues_Y01 ; /* Actual values read in from nnn.Xl file, 

If MW is the first and logp is the second value 
specified on the -rangevar argument list then 
RangeValues_Y01[n][0] would keep the value for MW 
J 5 for the nth line in the nnn.Xl file and 

RangeValues_Y01[n][ll would keep the value for 
logp for that line*/ 
static float **RangeValues_Y02 ; /* same */ 

static int -OneOfValues_Y01 ; /^Actual values read from nnn.Xl files but translated 
20 into an index of OneOfValues[i]. values so 

we dont have to waist memory and time doing strcmp*/ 
static int **OneOfValues_Y02 ; I* Same */ 
#endif 



static char *MasterFile ; 
25 static char *MasterRecord ; 

static FILE *MasterFile_File; 
static char *FngrFile; 
static int FingerCore_Card; 
static int *FingerCore_FP; 



30 static char *RangeVar ; 
static char *OneOfVar ; 
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static double 



Ratio - 0.003 ; 



static int 
static int 
static int 
5 static int 
static int 
static char 
static char 
static char 
10 static char 
static char 
static char 
static char 
static int 
15 static int 
static double 
int TotalProducts 
static int 



WordsPerFingerprint = 0; 
BytesPerFingerPrint = 0; 
NoMorehitsPlease = 999999999; 
DebugLevel; 
UserAborted; 
*OutputFileName; 
*CheckPointFileName; 
*WhatFirst; 
*InputSource = 0; 
*BitsetSource = 0; 
*DatabaseNames = (char *)0 ; 
*HitlistNames = (char *)0 ; 

BitOffsets[MAXJNPUT__CSLNS]; /* why recompute? */ 

CoreSym[MAXJNPUT^CSLNS]; 
*jX_01, *jX_02; 



Pro_size; 

static struct ParseOptions Optionsn = { 
20 /*** 

DO NOT MOVE ENTRIES IN THIS TABLE. ADD ENTRIES ONLY AT THE 
END. 

{"master", ParseOptString, &MasterFilc, 

"Name is the file with master file records" }, 
{"index" , ParseOptString, &MasterRecord, 

"Which MasterRecord or Bitset entry 1-n" }, 
{"comfa", ParseOptDouble, &Ratio , 

"Weighting for CoMFA fields (0.003)" }, 
{"distance", ParseOptDouble, &Distance, 

"Weighted neighborhood distance (0.240)" }, 
{"maxhits", ParseOptInt, &NoMorehitsPIease, 



25 



30 
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"Maximum number of hits before stopping" }, 
("bitset", ParseOptString, &BitsetSource, 

"Bitset file to start from"}, 
{"output" , ParseOptString, &OutputFileName, 
5 "File to which hit info will be written. "}, 

{"checkpoint", ParseOptString, &CheckPointFileName, 

"File to which bitset info will be written. "}, 
{"prefer", ParseOptString, &WhatFirst, 
"One of Rl, R2 to maximize us of."}, 
10 {"debug", ParseOptBoolean, &DebugLevel, 

"Use +clebug to enable debugging messages" }, 
#ifdef OBSOLETE_IS_OK 

{"rangevar", ParseOptString, &RangeVar, 

"Scalar field name and range to filter out, i.e. logp -1.0 8.0 MW 200 500 price 0 

15 12.50" }, 

{"oneof, ParseOptString, &OneOfVar, 

"Field name and list of values that the product should match\n, i.e. supplier 
Aldrich,Sigma,Fluka,SALOR taste SWEET,Salty" }, 
#endif 

20 {"database", ParseOptString, &DatabaseNames, 

"Unity database to use to exclude possible products" }, 
{"hitlist", ParseOptString, &HitlistNames, 
"Unity hitlist to use to exclude possible products" }, 

}; 

25 static int WarmUpO 
{ 

int i; 

for(i=0;i<65536;i++)BigBits[i] = (i&D + (i&2)/2 + (i&4)/4 + (i&8)/8 + 

(i&16)/16 + (i&32)/32 + (i&64)/64 + (i&128)/128 
30 + (i&256)/256 +(i&512)/512 +(i&1024)/1024 

+ (i&2048)/2048 

+ (i&4096)/4096 + (i&8192)/8192 + (i& 16384)/ 16384 
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+ (i&32768)/32768 ; 

setbits_nbits_InitO; 
return 1; 

} 

5 static int WhatsTheDifferenceQ 
{ 

int i, j; 

^define pow2(a) ( (a) * (a) ) 

/* the assignment of codes is based on the following (from gen_pls.c): 
10 static fpt cutoff[16] = {9999., 0., 2., 4., 6., 8., 10., 12., 

14., 16., 18., 20., 22., 24., 26., 30. }; 

boundary[01 = 9999.; /* missing data ought never to occur. */ 
boundary[l] = -0. 1 * Ratio; 
15 for(i=2;i< 15;i++) 

boundary[i] = (2*i-3) * Ratio; 
boundary[15] = 30.0 * Ratio; /* this is a steep curve with a cutoff at 30! */ 

for (i=0;i<16;i++) for O=0;j< 16;j + +) 
Dist[i]G] = pow2( boundaryli] - boundary[j]); 
20 Distance *= Distance; /* want to test D"2 directly *l 
return 1; 

} 

Static int CalcualteProductFingurePrint(product,firstPart,secondPart) 

int ^product ; 
25 int *firstPart ; 
int *secondPart ; 

{ 

int index ; 

int totalBitsSet = 0 ; 
30 unsigned char *prod , *y01, *y02 ; 

prod = ( unsigned char *)product ; 
yOl = ( unsigned char *)firstPart ; 
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y02 = ( unsigned char *)secondPart ; 

for (index =0;index < BytesPerFingerPrint;index + + ,prod + +) 
{ 

*prod = *y01 + + ! *y02++ ; 
5 totalBitsSet += nbits[*prod & 255]; 

} 

return totalBitsSet ; 

} 

static int IntersectQueryC pintr, pFP, pXntr, pXP, xuery, index, 
10 symmetric, yuery, pXntr2) 

int *plntr, **pFP; 
double *pXntr, *pXntr2; 
unsigned char **pXP, **xuery, **yuery; 
int index, symmetric; 
15 { 

unsigned char *ptr ,*qtr; 
int i, count; 
double xount; 

if ( !(*pFP) i i !(*pXP) ) 
20 return 1 ; 

ptr = (unsigned char *) *pFP; 
qtr = (unsigned char *) query; 
for(count=0, i=0; i< WordsPerFingerprint*4;i + +) 
count += nbits[ *ptr++ & *qtr++]; 
25 *plntr = count; 
if ( xuery ) 
{ 

ptr = (unsigned char *) *pXP; 
qtr = (unsigned char *) *xuery; 
30 for(xount=0.0, i=0; i<XytesPerFingerPrint[index];i + + , ptr+ + , qtr++) 
xount + = Dist[ *ptr & OxOF ][ *qtr & OxOF ] 

+ Dist[ (*ptr & OxFO) > > 4][ (*qtr & OxFO) > > 4] ; 
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*pXntr = xount; 

if ( '.symmetric) return 1; 

ptr = (unsigned char *) *pXP; 
5 qtr = (unsigned char *) *yuery; 

for(xount=0.0, i=0; i<XytesPerFingerPrint[index];i+ + , ptr+ + , qtr++) 
xount += Dist[ -^ptr «& OxOF ][ *qtr & OxOF ] 

+ Dist[ (*ptr & OxFO) > > 4][ (*qtr & OxFO) > > 4] ; 
*pXntr2 = xount ; 

10 } 

return 1; 

Itatic int ActuallyCompute( indexl, index2, pUnion, pintersection, pMaxTan,currentInput) 
int indexl, index2, *pUnion, *pIntersection; 
15 double *pMaxTan; 
{ 

int i; 

unsigned short *hl, *h2, *hquery, product; 
int numberOfMissingBits ; 
20 if ( currentlnput == -1 ) 

numberOfMissingBits = NumMissingBits[0] ; 

else 

numberOfMissingBits = NumMissingBits[currentInput] ; 

hi = (unsigned short *) Y_01[indexl]; 
25 h2 = (unsigned short *) Y_02[index2]; 
hquery = (unsigned short *) query; 
*pUnion = *pIntersection = 0; 

for(i=0; i<WordsPerFingerprint*2 ;i+ + ,hH- + ,h2 + + ,hquery+ +) 
{ 

30 /* product = (*hl | *h2) ;*/ 

*pUnion += BigBits[ (*hl i *h2) 1 *hquery]; 
*pIntersection += BigBits[ (*hl ! *h2) & *hquery]; 
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*pMaxTan 
return 1; 



= (double) (*pIntersection + numberOfMissingBits )/ (double) *pUnion; 



} 



5 static int 

ZapAllNeighbors(thisQuery,thisC_Query,numZapped,doCTOPSandexl,index2,^^^^^^^ 
) 

int *thisQuery ; 
int thisC_Query ; 
10 int *numZapped ; 
int doCTOPS ; 
int currentlnput ; 

{ 

int cqt, q_lo, i' J' ^arhold, inthold, onion, intsc ; 
15 double max, test,test2; 
int k ; 

int Y_01_Offset, Y_02_Offset ; 
int pos ; 

int numberOfMissingBits ; 

20 if (DebugLevel = = 69) 

printf(" time to zap %d - %d\n", indexl + 1, index2+l); 

if ( currentlnput = = -1 ) 

numberOfMissingBits = NumMissingBits[0] ; 

else 

25 numberOfMissingBits = NumMissingBits[currentInput] ; 

if ( thisQuery ) 
{ 

memcpy(query,thisQuery,BytesPerFingerPnnt) ; 
c_query = thisC_Query ; 

30 } 

*numZapped = 0 ; 

Y 01 Offset = Y_02_Offset = 0 ; 
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for ( k = 0 ; k < Currentlnput ; k4-+ ) 
{ 

Y_01_Offset + = Y_01_Length[k]; 
Y_02_Offset + = Y_02_Length[k]; 

} 

for (i=0;i<nY_01;i++) 
if (! IntersectQuery( iY_OH-i, 
Y_01+i, 
iX_01+i, 
X_01+i, 

(doCTOPS)?X_01 + index 1 + Y_01_Offset : NULL , 
0, 

CoreSy m[CurrentInput] , 

(doCTOPS;?X_02 + index2 + Y_02_Offset : NULL , jX_01+i)) 

return 0; 
for (i=0;i<nY_02;i++) 
if (! IntersectQuery( iY_02 + i, 
Y_02+i, 
iX_02+i, 
X_02+i, 

(doCTOPS)?X_02 + index2 + Y_02_Offset : NULL , 
1, 

CoreSym[CurrentInput] , 

(doCTOPS)?X_01 + index 1 + Y_01_Offset : NULL, jX_02+i)) 

return 0; 
/* now zap topomer neighbors */ 

/* 

** Only do topomer neighbors if CTOPS was present in the input. 
*/ 

if ( doCTOPS ) 
{ 

Y_01_Offset = Y_02_Offset = 0 ; 
for ( k = 0 ; k < Totallnputs ; k+ + ) 



426 

for(i= 0 ;i< Y_01_Length[k];i++) 
{ 

if ( !CoreSym[CurrentInput] && (iX_01[ i + Y_01_Offset ] > 



5 Distance) ) 



continue; 

for (j=0 ;j< Y_02_Length[k];j + +) 
{ 

if (UserAborted) 

20 return 1; 



switch (CoreSym[CurrentInput]) 

{ case 0: if ( iX_O20 + Y_02_Offset] > Distance) continue; 
test = iX_01[i+Y_01_Offset] + 



iX_021j + Y_02_Offset] ; 
J 5 break; 



case 1: 

test = iX_01[i + Y_01_Offset] + iX_02|j + Y_02_Offset]; 
test2= jX_01[i+Y_01_Offset] + 



jX_02[j + Y_02_Offset]; 

20 



if (test2 < test) test=test2; 
break; 

} 

if ( test < = Distance && 

!TcstBit(Dead_Products, 
25 BitMapStartPoint[k] + i 

*Y_02_Length[k] +j) && 

Am_I_Close(i + Y_01_Offset J + Y_02_Offset , test 

,currentlnput) ) 

{ 

30 if (DebugLevel = = 69) 

printfC'Distance kill %d %d - %f , %f + %f OR %f , %f + %f\n", 
i+lj + l, iX_01[i] + iX_02[j], iX_01[il , iX_02[j], 
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jX_01[i] -f jX_02[j], jX_01[i] , jX_02[jl ); 

pos = BitMapStartPoint[k] + i *Y_02_Length[k] -fj ; 
FlagProduct(Dead__Products, 0,0, pos ); 

SomeLeft— ; 

5 Reinaininglnput[k]~ ; 

(*numZapped)-h + ; 

} 

} /* Y_02 loop */ 
} /* Y_01 loop */ 
10 Y_01_Offset += Y_01_Length[k] ; 

Y_02_Offset + = Y_02_Length[k] ; 

} 
} 

} 

15 static int Am_I_Close(i,j, test, currentlnput) 
int i,j; 
double test; 
int currentlnput ; 

{ 

20 int onion, intsc; 

double max, Tanimoto; 
Tanimoto = 1. - sqrt( Distance - test); 
ActuallyCompute( i, j, c&onion, &intsc, &max, currentlnput); 
retum( max > = Tanimoto); 

25 } 

/* 

30 ** Abstract : Function zapps products who are missing CTOPS or FP fields. 
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** Usage '■ 

** Returns : 1 if the data value is missing or zero if the values exist. 

5 ** 

** Algorithms : None. 

** Revision History : 
*• 

10 ** Author Date Description 



** Fred Soltanshahi 05/21/96 Original version. 

** 

15 **-I: 
*/ 

static int IsItMissingAValue(indexl,index2,currentInput) 
int index 1 ; 
int index2 ; 
20 { 

int Y_01_Offset = 0 ; 
int Y_02_Offset = 0 ; 
int k ; 

for ( k = 0 ; k < currentlnput ; k+ + ) 
25 { 

Y_01_Offset += Y_01_Length[k] ; 
Y_02_Offset + = Y_02_Length[k] ; 

} 

if ( ( Y_01 [index l+Y_01_Offset] == NULL ) | | 
30 ( Y_02[index2 + Y_02_Offset] == NULL ) | | 

( X_0 1 [index l + Y_01_Offset] == NULL ) | | 
( X_02[index2 + Y_02_Offset] == NULL ) ) 
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return 1 ; 

} 

return 0 ; 



} 

5 static int GetNextLine( FILE *mePointer,FILE -^fingerfp.int *pCard,int -pFP, 

unsigned char **pXP, int index 
#ifdef OBSOLETEJS_OK 

, float **rangeValues, int **oneOfValues 

#endif 

10 > 
{ 

char *line, *fpcard, *fp, *CTOPS ; 
int words, hold; 
int pos ; 

15 if (.1 == UTL_SCAN_GETS( filePointer, "W", &line)) 
goto AddTraceback ; 

#ifdef OBSOLETE JS_.OK 
ReadLineAttributes(line, 

NumRangeFields, 
20 rangeValues, 

RangeFields, 
NumOneOfFields, 
oneOfValues, 
OneOfValues) ; 

25 #endif 

/* CTOPS = strstr(line,"CTOPS = ") + strlen("CTOPS-"); */ 
CTOPS - strstr(line, "CTOPS = ") ; 

if (!(*pFP = (int *) UTL_MEM_ALLOC( BytesPerFingerPrint))) 
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goto AddTraceback ; 
if (!UTL_FILE_FREAD( pCard,sizeof(int), 1 ,fingerfp)) 

goto AddTraceback ; 
if (!UTL_FILE_FREAD( *pFP ,sizeof(int), WordsPerFingerprint ,fingerfp)) 

5 goto AddTraceback ; 

if ( CTOPS ) 

{ 

CTOPS += strlenC'CTOPS"); 
UTL_SCAN_TOKENIZE(CTOPS, ' , 'W'); 
10 UTL_SCAN_TOKENIZE(CTOPS, • > ' , 'W'); 

words = strlen(CTOPS) / 2; /* must have 8 bit bytes */ 
if (!XytesPerFingerPrint[indexl) 
{ XytesPerFingerPrint[index] = words; 

} 

15 if ( words ! = XytesPerFingerPrint[index]) goto MissingValue; 
*pXP = (unsigned char *) UTL_MEM_ ALLOC (words); 
for (words=0;words< XytesPerFingerPrint[index];words+ +) 

{ 

memcpy(next2,CTOPS,2); 

20 CTOPS + = 2; 

sscanf(next2 , " %2x" , &hold); 

*(*pXP+ words) = (unsigned char ) hold; 

} 
} 



25 return 1; 

MissingValue : 

*pCard = 0 ; 
*pFP = (int *)NULL ; 
*pXP = (unsigned char *)NULL ; 
30 return 1 ; 

AddTraceback : 
return 0 ; 
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} 

static int not_here( what, nbytes ) 
unsigned char *what; 
int nbytes; 
5 { 

for ( ; nbytes; -nbytes) *what++ = ~*what; 
return 1; 

} 

/* this belongs in the utl module, actually */ 
10 int MakeComLine( char *line, int len, int argc, char **argv) 

{ 

int i; 

sprintf(line,"%s ",argv[0]); 
for(i=l;i<argc;i++) 
15 ( 

line += strlen(line); 
sprintf(line,"%s ".argvU]); 

} 

} 

20 static void UserHitControlCQ 
/*4-I 



This function is the signal handler for user initiated program termination. 

It's only role is to set a flag indicating that the user wishes to abort the program. 



25 * 

* Author Date Description 

* G. B. Smith 02-09-93 Original Version 

30 */ 
{ 



User Aborted = 1; 
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static int ReadEverythingO 
{ 

5 char *hold; 
char buff[255]; 
int i; 
int j; 

char *cp ; 
10 char *mp ; 
int offset ; 
int size ; 
char *input ; 

void *bitset[MAXJNPUT^CSLNS] ; 

15 /* because failure here means end program run, no effort to clean up 

memory on error is included. */ 

offset==0; 

if ( IMasterFile && IBitsetSource ) 

20 ^ fprintf(stderr,"An input file(master or bitset) must be specified\n"); 

return 0 ; 

} 

if ( MasterFile && BitsetSource ) 
{ 

25 fprintf(stderr,"A design run can be run from either a master or a bitset 

file\n"); 

return 0 ; 

} 

if (MasterFile && IMasterRecord ) 
30 { 

fprintf(stderr,"A Bitset (or Master) record number must be specified\n"); 
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return 0 ; 



if (IWarmUpO i I '.WhatsTheDifferenceQ) return 0; 
/* 

5 Special case where we want to process all the records in the 
** master file. 
*/ 

if ( atoi(MasterRecord) == -1 ) 
{ 

10 if ( ( Totallnputs = CountMasterRecords(MasterFile)) - - 0 ) 

goto UnableToReadMaster ; 
for ( i = 0 ; i < Totallnputs ; i++ ) 
{ 

InputNames[i] = UTL_STR_SAVE(MasterFiIe); 
15 InputStartRec[i] = i+1 ; 

} 

} 

else 

{ 
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if ( MasterFile ) 

input = UTL_STR_SAVE(MasterFile); 

else 

input = UTL_STR_SAVE(BitsetSource); 



25 ** If there are more than one input file, process them all. 
*/ 

cp = strtok(input," "); 

while ( cp ) 

{ 

30 InputNames[TotalInputs++] = UTL_STR_SAVE(cp); 

cp = strtok(NULL," "); 

} 

mp = strtok(MasterRecord," "); 
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for ( i = 0 ; i < Totallnputs ; i+ + ) 
{ 

If the user specified record numbers for all the master files, then use them 
5 ** otherwise we will use the first record. 
*/ 

if ( mp ) 

{ 

InputStartRec[i] = atoi(mp); 
mp = strtok(NULL," "); 

} 

else 

InputStartRec[i] = 1 ; 



} 



15 } 



mp = strtok(CheckPointFileName," "); 
for ( i = 0 ; i < Totallnputs ; i+ + ) 
{ 

if ( mp ) 
20 { 

OutputCheckpointNames[i] = UTL_STR_SAVE(mp); 
mp = strtok(NULL," "); 

} 

else 

25 ( 

sprintf(buff , " %s_%d_chk.bs" ,basename(InputNames[i],NULL),i); 
OutputCheckpointNames[i] = UTL_STR_SAVE(buff); 

} 

} 

30 nY_01 = nY_02 = 0 ; 
if (Totallnputs > 1) 

fprintf(stderr,"All files assumed to be for same coreAn"); 
for ( i = 0 ; i < Totallnputs ; i++ ) 
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if ( MaslerFile ) 
{ 

if ( !RetrieveMasterFile(InputNames[i], 

MasterFile_File, 
Inputs tartRec[i], 
&(NumMissingBits[i]) , 
&(BitsInAbsentiaNoCount[i]) , 

&(CoreFileNames[i]) , 

&(CoreStart[i]), 

&FngrFile, 

&(Xlfile[i]), 

&(X2file[i]), 

&(Y_01_Length[i]), 

&(Y_02_Length[i]), 

&(fingerFP[i]), 

&(fingerOffsets[i]), 

&ScreenFileName, 

&BytesPerFingerPrint, 

&WordsPerFingerprint, 

&query, 

&FingerCore_FP, 
&FingerCore_Card) ) 
goto UnableToReadMaster ; 
Remaininglnput[i] = Y_01_Length[i] * Y_02_Length[i]; 

} 

else 

{ 

if ( !( bitset[i] = CS_PRDCT_BITSET_OPEN(InputNames 

InputStartRec[i])) ) 

goto UnableToReadBitset ; 
if ( !RetrieveMasterFileFromBitset(bitset[i], 

&(MasterFile_Bitset[i]) , 
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&(StartRec_Bitset[i]) , 

&(NumMissingBits[il) , 

&(BitsIn AbsentiaNoCount[i]) , 

&(CoreFileNames[i]), 

&(CoreStart[i]), 

&FngrFile, 

&(Xlfile[i]), 

&(X2file[i]), 

&(Y_01_Length[i]), 

&(Y_02_Length[i]), 

&(fingerFP[i]). 

&(fingerOffsets[i]), 

&ScreenFileName, 

&BytesPerFingerPrint, 

&WordsPerFingerprint, 

(fequery, 

&FingerCore_FP, 
&FingerCore_Card) ) 

goto UnableToReadBitset ; 
RemaininglnputM = CS_PRDCT_BITSET_SELECTED(bitset[i]); 

} 

nY_01 += Y_01_Length[i] ; 
nY_02 + = Y_02_Length[i] ; 

RetrieveSymmetry(CoreFileNames[i],CoreStart[i],&(CoreSym[i]) ); 

} 

if (! (Y_01 = (int **) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) 

goto UnableToAllocateMemory ; 
if(!(cY_01 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_01))) 

goto UnableToAllocateMemory ; 
if (!(iY 01 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_01))) 

goto UnableToAllocateMemory ; 

if (! (X_01 = (unsigned char **) 

UTL_MEM_ALLOC(sizeof(unsigned char *)*nY_01))) 
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goto UnableToAllocateMemory ; 
if (!(iX_01 = (double *) UTL_MEM_ALLOC(sizeof(double ) * nY_01))) 

goto UnableToAllocateMemory ; 
if (!OX_01 = (double *) UTL_MEM_ALLOC(sizeof(double ) * nY_01))) 
5 goto UnableToAllocateMemory ; 

#ifdef OBSOLETE_IS_OK 
if ( NumRangeFields ) 

^ if(!(RangeValues_Y01 = (float -) UTL_MEM_ALLOC(sizeof(float * nY_01))) 
JO goto UnableToAllocateMemory ; 

} 

if ( NumOneOfFields ) 

^ if (!(OneOfValues_Y01 = (int -) UTL_MEM_ALLOC(sizeof(int *) * nY_01))) 
J 5 goto UnableToAllocateMemory ; 

} 

#endif 
/* 

Read all the values for the XI file. 

20 */ 

for ( j = 0 ; j < Totallnputs ; j + + ) 
{ 

if ( !(fileHandles[j] = fopen(XlfileIj],"r")) ) 
goto UnableToOpenXlFile ; 
25 for (i=0;i<Y_01_Length[j];i++) 

{ 

if (! GetNextLine( fileHandleslj], 
fingerFPU], 

cY_01+i + offset , 
Y_01+i + offset , 
X_01+i + offset , 
0 

#ifdef OBSOLETE_IS_OK 
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,RangeValues_Y01 + i + offset , 

OneOfValues_Y01 + i + offset 



#endif 



)) return 0; 

5 } 

offset += Y_01_Length[j] ; 

fclose(fileHandlesIj]); 

} 

if (! (Y 02 = (int UTL_MEM_ALLOC(sizeof(int *) * nY_02))) 
IQ goto UnableToAUocateMemory ; 

if (!(cY_02 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_02))) 

goto UnableToAUocateMemory ; 
if (!(iY_02 = (int *) UTL_MEM_ALLOC(sizeof(int ) * nY_02))) 

goto UnableToAUocateMemory ; 

15 if (! (X_02 = (unsigned char **) 

UTL_MEM_ALLOC(sizeof(unsigned char *) * nY_02))) 

goto UnableToAUocateMemory ; 
if (!(iX_02 = (double *) UTL_MEM_ALLOC(sizeof (double ) * nY_02))) 
goto UnableToAUocateMemory ; 
20 if (!(jX_02 = (double *) UTL_MEM_ALLOC(sizeof (double ) * nY_02))) 

goto UnableToAUocateMemory ; 

#ifdef OBSOLETE_IS_OK 

if ( NumRangeFields ) 

25 ^ if(!(RangeValues_Y02 = (float **) UTL_MEM_ALLOC(sizeof(float *) * 

nY_02))) 

goto UnableToAUocateMemory ; 

} 

if ( NumOneOfFields ) 
30 { 

if (!(OneOfValues_Y02 = (int **) UTL_MEM_ ALLOC (sizeof(int *) * 

nY 02))) 
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goto UnableToAlIocateMemory ; 

} 

#endif 

offset = 0 ; 

5 for ( j = 0 ; j < Totallnputs ; j + + ) 

{ 

if ( KfileHandlesO] = fopen(X2fileG],"r")) ) 

goto UnableToAlIocateMemory ; 
for (i=0;i<Y_02_Length[j];i++) 

10 { 

if (! GetNextLine( fileHandles[j], 

fingerFP[j], 

cY_02+i+ offset , 

Y_02+i+offset, 

X_02+i+offset , 

1 

#ifdef OBSOLETE_IS_OK 

,RangeValues_Y02 + i + offset , 

OneOfValues_Y02 + i + offset 



20 #endif 



)) return 0; 

} 

offset + = Y_02_Length[j] ; 
fclose(fileHandles[j]) ; 



25 } 
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if (!Good_l) /* note: Good_l is never used but triggers other allocations */ 
{ 

i= (nY_01+31)/32 * 4; 
if (!(Good_l = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Good_l,0,i); 
if (!(Dead_l = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Dead_l,0,i); 
i= (nY_02 + 31)/32 * 4; 

if (!(Good_2 = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Good_2,0,i); 
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if (!(Dead_2 = (int *) UTL_MEM_ALLOC(i))) return 0; memset( Dead_2,0,i); 

for ( size = 0 , j = 0 ; j < Totallnputs ; j + + ) 
{ BitOffsets[j] = size; 

size += ( Y_01_LengthDl * Y_02_LengthD] ) ; 

5 } 

Pro_size = size = ( size + 31 )/32 * 4 ; 
if (!(Good_Products = (int *) UTL_MEM_ALLOC(size))) return 0; 

memset( Good_Products,0,size); 
if (!(Dead_Products = (int *) UTL_MEM_ALLOC(size))) return 0; 
IQ memset( Dead_Products,0,size); 

if ( '.MasterFile ) /* gather the dead together.... */ 

Mortuary(bitset, Totallnputs, Dead_Products, size, BitOffsets); 

} 

offset = SomeLeft = 0 ; 
15 /* 

** Figure out the number of products for each input set and the total 

** number of products. 

*/ 

for ( j = 0 ; j < Totallnputs ; j + + ) 
20 { 

BitMapStartPointQ] = offset ; 

offset += Y_01_Length[j] * Y_02_I^ngth[j] ; 

SomeLeft += Remaininglnput[j]; 

} 

25 TotalProducts = offset ; 
#ifdef OBSOLETE_IS_OK 
/* 

** Initialize the needed structures to pass around. 
*/ 

30 RangeValuesData.numRangeFields = NumRangeFields ; 

RangeValuesData.rangeValues_Y01 = Range Values_Y01 ; 
RangeValuesData.rangeValues_Y02 = RangeValues_Y02 ; 
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RangeValuesData-rangeFields = RangeFields ; 
OneOfValuesData.numOneOfFields = NumOneOfFields ; 
OneOfValuesData.oneOfValues_Y01 = OneOfValues_Y01 ; 
OneOfValuesData.oneOfValues_Y02 = OneOfValues_Y02 ; 
5 OneOfValuesData.oneOfFields = OneOfValues ; 

#endif 

InputData.totallnputs = Totallnputs ; 
InputData.Y_01_Length = Y_01_Length ; 
InputData.Y_02_Length = Y_02_Length ; 

10 #if 0 
/* 

Read in the -rangevar values if they are present in the csln file. 

*/ 

if( !ReadRangeVarFromCoreFile( Corefile, 
j5 NumRangeFields, 

RangeFields) ) 

return 0 ; 

#endif 
return 1; 
20 UnableToOpenXlFile : 

fprintf(stderr," Unable to open reagant file\n"); 
goto AddTraceback ; 
UnableToAUocateMemory : 

fprintf(stderr, "Unable to allocate memory\n"); 
25 goto AddTraceback ; 

UnableToReadBitset : 

fprintf(stderr," Unable to Read bitset file\n"); 
goto AddTraceback ; 
UnableToReadMaster : 
30 fprintf(stderr, "Unable to Read master file\n"); 

goto AddTraceback ; 
AddTraceback : 
return 0 ; 
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int RetrieveSymmetry( char *FileName, int Start, int *pSym ) 
{ 

FILE *tmp; 
5 char *line; 

if (!(tmp = fopen(FileName, "r"))) return 0; 

for ( ; Start; Start--) 

if (.1 UTL_SCAN_GETS( tmp, "W", &line)) return 0 ; 
fclose(tmp); 

10 if (strstr(line."SYM=l")) *pSym = 1; else *pSym = 0; 
return 1; 



} 

/* concatenate a series of compressed bitsets into one big raw bitset 
15 ~> AND <- destroy those compressed bitsets 

int Mortuary(void ^bitsetQ, int nsets, int *rawbits,int byte_size, int ^offset) 

{ 

int i ; 

for (i=0; i< nsets; i++) 
20 { CS PRDCT_BITSET_CONCAT_RAW( bitset[i], rawbits, offset[i], 0); 
CS~PRDCT_BITSET_DESTROY_BIT_STRING(bitset[i]); 

bitset[i] = NULL; 

} 

not_here( rawbits,byte_size ); 
25 } 



static int ParseArguments( argc, argv ) 
/*+! 

* This function parses the command line arguments. 
30 * 
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* Returns: 1 on a successful command line parse, 0 otherwise. 

* Warnings: 
5 * Errors: 

* See Also: 



Description 

G. B. Smith 02-09-93 Original Version 



10 * Author Date 
* = = = = = = 

* 



*/ 

15 int argc; 
char **argv; 

{ 

int nargs, 

noptions = sizeof( Options )/sizeof(Options[0]); 

20 OutputFile = stdout; 

nargs = UTL_PARSE_OPT( argc, argv, noptions, Options ); 

if( ! nargs ) goto SyntaxError; 
if (WhatFirst) 

{ if (strstr(WhatFirst,"Rl")) WhatFirst[0] = ' T ; 
25 if (strstr(WhatFirst, "R2")) WhatFirst[0] = '2' ; 

} else { 

WhatFirst = UTL_MEM_ALLOC(2); WhatFirst[0] = '0'; } 
#ifdef OBSOLETE_IS_OK 
if ( RangeVar && ! 

30 ParseRangeVar(RangeVar,&NumRangeFieldsAllocated,&NumRangeFields,^ 

goto SyntaxError ; 
if ( OneOfVar && 

!ParseOneOfVar(OneOfVar,&NumOneOfFieldsAllocated,&NumOneOfFields,&OneOfValue 
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s)) 

goto SyntaxError ; 

#endif 

return 1; 
5 SyntaxError: 

return 0; 

} 

static int OpenOutputFileQ 
/*+! 
10 * 

* Returns: 1 on sucesss, else 0 
{ 

15 char *msg; 

FILE *fp; 
OutputFile = stdout; 
if( OutputFileName ) 

{ 



20 /* 



We need to create output files under the ownership of the REAL user not the 
EFFECTIVE user. This only applies if setuid options are activated. 
*/ 
{ 

25 struct Stat statBuff ; 
int uid ; 
int euid ; 

uid = getuidO ; 
euid = geteuidO; 
30 stat(OutputFileName, &statBufO; 

** There are two cases 

** (1) the file to output to exists 
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Use the ownership of the current owner of the file or if you cant do that 
** do not do anything. 
** (2) The file is being created. 
** use the ownership of the REAL user. 

5 */ 

if ( access(OutputFileNanie, F_OK) = = 0 ) 
( /* If the file exist and the real user is the owner of the file */ 
if ( statBuff.st_uid == uid ) 
seteuid(uid); 

10 } 

else 

{ /* Create the file as the REAL user */ 
seteuid(uid); 

} 

15 } 

OutputFile = fopen( OutputFileName, "wb"); 

if( '.OutputFile ) { 

fprintf(stderr, "Error: Failed to open output file \"%s\"\n", 

OutputFileName ); 
20 goto ErrorRetum; 

} 

} 

return 1; 
ErrorRetum: 
25 return 0; 

} 



static CloseOutputFileO 
/* + I 
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* This function closes the output file. It is included just for cleanliness. 

* Author Date Description 



5 * G. B. Smith 02-09-93 Original Version 

*l 
{ 

fclose( OutputFile ); 

10 } 

CheckPointProgram(programName) 

char *programName ; 

{ 

int sizes[2] ; 
15 int allocSizes[2] ; 
int numInSites[2] ; 
char hold[81] ; 
int i ; 

void *compressed ; 
20 int total ; 

for ( i = 0 ; i < Totallnputs ; i++ ) 
{ 

sizes[0] = Y_01_Length[i] ; 
sizes[l] = Y_02_Length[i] ; 
25 numInSites[0] = numInSites[l] = -1 ; 

allocSizes[0] = allocSizes[l] = -1 ; 

** Lets get a compressed version of the dead products before we write it out 
** to file. 
30 */ 

compressed = CS_PRDCT_BITSET_CREATE_BIT_STRING( 
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Dead^Products, 
BitMapStartPoint[i], 

2, 

sizes, 
sizes, 
&total); 

WriteOutCheckPointFile(OutputCheckpointNames[i] , 

MasterFile ? InputNames[i] 

: MasterFile_Bitset[i], 
MasterFile ? InputStartRec[i] 
: StartRec__Bitset[i], 

programName, 

Good_Products, 

BitMapStartPoint[i], 

2, 

sizes, 

allocSizes, 
Selections[i], 
numlnSites, 
total, 

compressed); 

CS PRDCT_BITSET_DESTROY__BIT_STRING(compressed); 

} 

} 

int main( argc, argv ) 
/* + E 

*/ 

int argc; 
char **argv; 
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{ 

long startTime, 
totalTime, 
fmishTime; 

5 int numFiltered ; 

int numEliminated ; 
int tmp ; 

char comline[2048]; 
10 *** Establish handler for a user interrupt. 

signal( SIGINT, UserHitControlC); 

#ifdef SIGHUP 

signal( SIGHUP, UserHitControlC); 

15 #endif 
/* 

** Initialize variables. 
*/ 

Distance = 0.240; 
20 if( !ParseArguments( argc, argv ) ) 

goto SyntaxError; 
if( !OpenOutputFile() ) goto FailureExit; 
f* if ('.RestartStateO) goto FailureExit; */ 
time( &startTime ); 
25 Visual((stderr, "Begin reading files: %s",ctime(&startTime))); 

/* Let's actually do something now */ 

if (!ReadEverything(NoMorehitsPlease)) goto FailureExit; 

time( &finishTime ); 

Visual((stderr, "Begin filtering: %s",ctime(&finishTime))); 
30 if 

(!FilterProducts(&InputData,&RangeValuesData,&OneOfValuesData,&numFiltered,IsItMiss 

ingAValue)) 
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goto FailureExit; 
Currentlnput = 0 ; 
time( &finishTime ); 

Visual((stderr,"Filtered out %d out of %d possible products\n",numFiltered, 
5 TotalProducts )); 



#if 0 
/* 

** Now see if there are any hitlists or databases that you should filter 
** for. 
10 */ 

time( &finishTime ); 

Visual((stderr, "Begin eliminating selections in Unity database 

% s" ,ctime(&finishTime))); 

if ( !EliminateProductsFromDatabase(DatabaseNames, 

15 



&numEliminated, 
ZapAllNeighbors)) 



goto FailureExit; 
tinie( &finishTime ); 
time( (fefinishTime ); 
20 Visual((stderr, "Begin eliminating selections in Unity hitlist 

%s" ,ctime(&fmishTime))); 

if ( !EliminateProductsFromHitlist(HitlistNames, 

ScreenFileName?ScreenFileName:DefaultScreenFileName, 

25 &tmp, 

ZapAllNeighbors)) 

goto FailureExit; 
time( &fmishTime ); 

Visual((stderr,"Eliminated %d out of %d possible products\n",numEliminated + tmp, 

30 TotalProducts )); 
#endif 

Visual((stderr, "Begin selection: %s",ctime(&fmishTime))); 
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if (lUserAborted && 

!SelectEverything(InputSource, 

NoMorehitsPlease, 
WhatFirst, 

^ CalcualteProductFingurePrint, 

ActuallyCompute, 
ZapAUNeighbors)) goto FailureExit; 

CloseOutputFileO; 
time( &finishTime ); 

0 totalTime = fmishTime - startTime; 

if( ItotalTime ) totalTime = 1; 

Visual((stderr, "Created %d Selections in nProcessed )); 
Visual((stderr,"%d Hours, %d min, %d secs\n", 
totalTime/(60*60), 
j5 (totalTime%(60*60))/60, 

(totalTime%60))); 

Visual((stderr,"Each comparison required %.8f seconds to calculateVn", 
(totalTime/((double)(nProcessed?nProcessed:l))))); 

Visual((stderr,"End Quick Select Computation: %s",ctime(&fmishTime))); 
20 MakeComLine(comline, 2048, argc, argv); 

CheckPointProgram(comline); 

User Aborted ? exit(ErrorExit) : exit(GoodExit); 
SyntaxError: 

exit(l); 
25 FailureExit: 

exit(EnorExit); 

} 
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A ppendix "M" 



*l 

I* listbitset 
*/ 

#include <stdio.h> 
#include < signal. h> 
#include <ctype.h> 
10 #include <unistd.h> 
#include < string. h> 
^include <sys/stat.h> 
#include <math.h> 
^include "parseopt.h" 
15 #include "utl_str.h" 
#include "utl_mem.h" 
#include "utl_rile.h" 
^include "utl_math.h" 
#include "ct.h" 
20 ^include "ct_expr.h" 
^include "ct_proto.h" 
#include "import_proto.h" 
^include "io_fprint.h" 
^include "hits.h" 
25 #include "hits_proto.h" 

#include "commonData.h" /* Globals use by most functions, we will clean this 

up soon */ 

^include "dbcsln_bs_proto.h" 
^include "dbcsln_hlm _proto.h" 
30 extern char *basename() ; 

extern char *DB_CT_CCT_FIX_SLN(); 
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typedef enum 
( 

HeaderOnly, 
FullListing, 
5 DetailListing, 
Hitlist 
} ListingOptions ; 

n - ; "hMf)£»r" "full" "detail" , "str list } ; 
static char *OptionsNamesD = { header , luu , u , _ 

static ListingOptions listOption = HeaderOnly ; 
10 static char *HitlistFile ; 
static char *BitsetFile ; 
static int UserAborted; 

static char *CombNameTemplate= (char *)NULL ; 
static int CombCounter; 
15 char *HiUistName = "listbitset.hits" ; 
FILE *HitFile ; 
static char *Prefix ; 
static struct ParseOptions OptionsQ = 

{ 

on /*** ^„ 
... DO NOT MOVE ENTRIES ,N THIS TABLE. ADD ENTRIES ONLY AT THE 

END. 

{"hitlist", ParseOptString, &HitlistFile, 

"Name is the file with hitlist records, ie. xxxxx.hits file" }, 
{"bitset", ParseOptString, &BitsetFile, 

"Name is the bitset file . ie. xxxxx.csr file" }, 
("list", ParseOptEnum, (void *)OptionsNames , 

"Type of output" }, 
{"output" , ParseOptString, &HitlistName, 

"Name is the file with hitlist records, ie. xxxxx.hits file" }, 
{"prefix", ParseOptString, &Prefix, 

"Prefix for naming the products. Product name will be Prefix_Y01_Y02_n }, 



25 



30 
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}; 

static void UserHitControlCQ 
/*+I 

5 * This function is the signal handler for user initiated program 
termination. 

* It's only role is to set a flag indicating that the user wishes to abort 
the program. 



10 * Author Date Description 

* G. B. Smith 02-09-93 Original Version 

*/ 
15 { 

User Aborted = 1; 

} 

static int ParseArguments( argc, argv ) 
/*+I 



20 



* This function parses the command line arguments. 
* 

* Returns: 1 on a successful command line parse, 0 otherwise. 



25 * Warnings: 

* Errors: 

* See Also: 
30 * 
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* Author Date Description 

* G. B. Smith 02-09-93 Original Version 

5 */ 

int argc; 
char **argv; 

{ 

char *fileTypeName ; 
10 int nargs, 

noptions = sizeof( Options )/sizeof(Options[0]); 
nargs = UTL_PARSE_OPT( argc, argv, noptions, Options ); 
if( ! nargs ) goto SyntaxError; 
fileTypeName = *((char^*)Options[2].value); 
15 if( !strcmp( "detail", fileTypeName )) 

listOption = DetailListing; 
else if( !strcmp( "full", fileTypeName )) 

listOption = FullListing; 
else if( !strcmp( "strjist", fileTypeName )) 
20 listOption = Hitlist; 

else 

listOption HeaderOnly; 
return 1; 
SyntaxError: 
25 return 0; 

} 

int CallbackFunc(ct,numAttachments,indexes) 
struct CtConnectionTable *ct ; 
int numAttachments ; 
30 int indexes^ ; 
{ 

Static char *sln = 0; 
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static char nameBuffer[41]; 
CombCounter +=1; 

if ( CombNameTemplate && *CombNameTemplate ) 
{ 

if ( !DB_CT_CT_ATTR_EXISTS( ct, CtCtName ) && 
!DB_CT_CT_ATTR_EXISTS( ct, CtCtRegId )) 

{ 

(void)sprintf( nameBuffer, ••%.30s_%d_%ci_%0d" 



CombNameTemplate, 

indexes[0], 
indexes[l], 

CombCounter); 
if (!DB_CT_SET_CT_ATTR( ct, CtCtName, nameBuffer )) 
goto trc; 

} 

} 

if ( sin = DB_CT_SLN_GENERATE( ct )) 
{ 

fprintf( (FILE*)HitFile, "%s\n", sin ); 
UTL_MEM_FREE( sin ); 

} 

else { 

if ( UTL_ERROR_IS_SET()) 
goto trc; 

} 

return 1; 

trc: UTL_ERROR_ADD_TRACE( "CallbackFunc" ); 
return 0; 

} 

static struct HitsHitList *CreateHitlist( hName, cSln, core ) 
char *hName; 
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char *cSln; 
char *core; 

{ 

struct HitsHitList *hitlist; 
5 char *bname; 

int len; 

hitlist = DB_HITS_CREATE( "STRLIST" ); 
if ( '.hitlist ) goto trc; 

if (IhName | j !*hName ) hName = "SCRATCH"; 
10 bname = (char *) basename( hName, (char*)0 ); 

DB HITS_SET_ATTR( hitlist, HitsAttrName, bname ); 

if ( Prefix ) 

{ 

CombNameTemplate = UTL_STR_SAVE(Prefix); 

15 } 

else 

{ 

if ( ! CombNameTemplate && bname ) 
{ 

20 CombNameTemplate = bname; 

for (len = 0, bname = CombNameTemplate+ 1 ; *bname; ++bname, 

+ +len ) 

( 

if ( !isalnum( *bname ) | | len == 36 ) 
25 { 

*bname - 0; 
break; 

} 

} 

30 } else { 

if (bname) UTL_MEM_FREE(bname); 

} 
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} 

DB HITS_SET_ATTR ( hitlist, HitsAttrFilename, hName ); 
DB HITS_SET_ATTR (hitlist, HitsAttrDatabase, "NONE" ); 
DB^HITS_SET_ATrR ( hitlist, HitsAttrSource, "SLN_EXPLODER" ); 

if ( cSln && *cSln) 

DB_HITS_SET_ATTR ( hitlist, HitsAttrQuery, cSln ); 



if ( core && *core ) 

DB_HITS_SET_ATTR ( hitlist, HitsAttrCore, core ); 

[0 DB_HITS_SYNC_FILE( hitlist, hName, (void*)0 ); 

return hitlist; 

trc: UTL_ERROR_ADD_TRACE( "CreateHitlist" ); 
return 0; 

} 

15 DumpBitsetInfo(char *bitsetName,int blndex,char *core ) 
{ 

void *bitset ; 
int numProducts ; 
int *sizes= (int *)NULL ; 
20 int *numUsed= (int *)NULL ; 
int i ; 

char *masterName ; 

int masterRec ; 

char *coreInfo ; 
25 char *xrString ; 

int numSites ; 

char **xNames ; 

char *bName ; 

char *newCore ; 
30 char *cp ; 
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char buffer[1024] ; 
int **prcxluctlndexes ; 
void *c ; 

struct HitsHitList *h, *CreateHitlistO; 

5 char *fixedCore ; 

if ( '.(bitset = cS_PRDCT_BITSET_OPEN(bitsetName,bIndex))) 

^ fprintf(stderr,"UnabIe to open %s %d\n",bitsetName,bIndex); 
return 0 ; 

10 } 

if (! CS_PRDCT_BITSET_GET_STATS(bitset, 

(fenumSites, 
&numProducts, 
&sizes, 
&numUsed)) 

15 

^ fprintf(stderr,"Unable to get stat on %s %d\n",bitsetName,bIndex); 
return 0 ; 

} 

20 if ( numProducts = = -1 ) 

numProducts = 0 ; 
if ( ! core ) 
{ 

CS_PRDCT_BITSET_COREJNFO(bitset, 

&masterName, 
&masterRec, 
&coreInfo , 
&xrString, 
&numSites, 
&xNames); 

newCore = Replace_Y^Ox_With_Xx(coreInfo); 

} 

else 



25 



30 
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{ 

newCore = Replace_Y_Ox_With_Xx(core); 

} 

if ( listOption = = HiUist ) 

^ fixedCore = DB_CT_CCT_FIX_SLN(newCore,l); 
if (!(h = CreateHitlisK HitlistName, "query_sln" , "core.sln"))) 
return 0; 

if (KHitFile = fopen(HitlistName,"a"))) 
return 0; 

^ Allocate the arrays, 

productlndexes = (int -)UTL_MEM_CALLOC(numSites,sizeof(int *)); 
for ( i = 0 ; i < numSites ; i++ ) 

' produfllndexesH = (in. .,UTL.MEM^CALLOC(numProduc,s,sizeof(in, )) 

! 

Figure out what indexes compose a product. 

"/ 

CS_PRDCT_BITSET_GET_HITS(bitset,productIndexes); 

/* 

productIndexes[0][nProcessed] = Y_01 ; 

productIndexes[l][nProcessed] = Y_02 ; 

c = (void *) DB_CT_CCT_GET_PRD_INIT(fixedCore, 

xrString, 
numSites, 
xNames); 



if ( !c ) 
{ 



fprintf(stderr,"\nUnable to init"); 
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return -1 ; 

} 

DB_CT_CCT_GET_PRD_PRODUCT(c, 

numProducts, 
prcxiuctlndexes, 
CallbackFunc) ; 

DB_CT_CCT_GET_PRD_CLEANUP(c); 

DB_HITS_CLOSE( h ); 
if ( CombNameTemplate ) 
10 UTL_MEM_FREE( CombNameTemplate ), 

CombNameTemplate = 0; 
fclose(HitFile); 



15 



20 



} 

else 

( 

fprintf(stdout," %s %d\n" ,bitsetName,bIndex); 

cp = strtok(newCore,"<"); 

bName = basename(bitsetName,NULL); 

if ( cp ) 

sprintf(buffer, 

■■ %s < CS_PRD MTSET.FILE=X- %s\-;CS_PRD_BITSET_OFFSET=V %iV > 

cp,bName,bIndex); 



else 



sprintf(buffer, 



25 .^s<CS_PRD.B,TSET.F.LE=V-%sX-.CS.PRD_B,TSBr.OFFSET=^-%dV->", 

newCore,bName,bIndex); 
fprintf(stdout,"' %s\n" ,buffer); 

fprintf(stdout,"Num Products : %d of %d\n",numProducts,sizes[0]*sizes[l]); 

for ( i = 0 ; i < numSites ; i + + ) 
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fprimf(sU.ou..-N>.m Y.O%d : %d of WX„-,i+.,™mUsedin si..*)); 
,f ( ( UstOpuon . = FuUListing , 1 1 ( .istOp«o„ > - De,ailLisung ) ) 
CS_PRDCT_BITSET_DUMP(bitset); 

return 1 ; 
} 



} 



CheckPointProgramO 
{ 

^rintf(stderr, 



"CheckPointProgramO is a lonely stub in listbitset.c'An"); 



10 } 



int main( argc, argv ) 
/*+E 



*/ 

15 int argc; 
char **argv; 

{ 

int totalHits ; 
int i ; 
20 intj ; 
int hid ; 
int bindex ; 
char bitsetName[1024] 

char hold[81]; 
25 char *cp ; 
char *pRet ; 
char *dir ; 
char *fullPath ; 
int n ; 
30 char *core ; 
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10 



*** Establish handler for a user interrupt. 
***/ 

signalC SIGINT, UserHitControlC); 

#ifdef SIGHUP 
5 SignalC SIGHUP, UserHitControlC); 

#endif 

if( !ParseArguments( argc, argv ) ) 

goto SyntaxError; 
if ( IHitlistFile && '.BitsetFile ) 

^ fprintf(stderr,"An input (bitset or hitlist) file is required\n"); 
goto SyntaxError ; 

} 

if ( HitlistFile ) 

^ if ( !(hld = CS_HLM_OPEN_HITLIST(HitlistFile)) ) 
goto UnableToOpenHitlist ; 
dir = dimame(HitlistFile); 

totalHits = CS_HLM_GET_HITS_TOTAL(hId) ; 
for ( i =0 ; i < totalHits ; i++ ) 

^ pRet = CS_HLM_GET_HITS(hId, i, 1); 



15 



20 



25 *l 



30 



Grab the bitset file name and the offset from the csln. 

CP = strstr(pRet,"CS_PRD_BITSET_FILE="); 

if ( !cp ) 

goto InvalidCsln ; 

cp + = 20 ; 

j =0; 

while ( *cp != "" ) 

bitsetName[j + +] = *cp++ ; 

bitsetNamelj] = 0 ; 
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CP = strstr(pRet,"CS_PRD_BITSET_OFFSET 

if ( !cp ) 

goto InvalidCsln ; 

cp += 21 ; 
j =0; 

while ( *cp != ';' ) 

hold[i + +] = *cp++ ; 
holdOl = 0 ; 
bindex = atoi(hold); 
if ( dir ) 

fullPath = 

UTL_FILE_ADD_DIR_TO_DIRSPEC(dir,bitsetName); 
else 

fullPath = bitsetName ; 
core = strtok(pRet,"\n"); 
core = strtok(NULL,"\n"); 
DumpBitsetInfo(fullPath ,blndex , core) ; 

} 

} 

20 else 
{ 

n = CountBitSets(BitsetFile); 

for ( i = 0 ; i < n ; i+ + ) 

DumpBitsetInfo(BitsetFile,n,NULL); 

25 } 

User Aborted ? exit(ErrorExit) : exit(GoodExit); 

SyntaxError: 

exit(l); 
FailureExit: 
30 UnableToOpenHitlist : 

InvalidCsln : 

exit(ErrorExit); 

} 
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Ap pendix "N" 
rODATA 



#include <stdio.h> 
^include <signal.h> 
5 ^include <ctype.h> 
^include <unistd.h> 
#include <string.h> 
^include <sys/stat.h> 
^include <math.h> 
10 ^include "parseopt.h" 
#include "utl_str.h" 
#include "utl_mem.h" 
^include "utl_file.h" 
#include "utl_math.h"' 
15 #include "ct.h" 

^include "ct_expr.h" 
#include "ct_proto.h" 
^include "import_proto.h" 
^include "io_fprint.h" 

20 ^include "dservTypes.h" 



#if 0 

int NumRangeFields ; 

int NumRangeFieldsAllocated ; 

RangeStruct *RangeFields ; 

25 int NumOneOfFieldsAllocated ; 

int NumOneOfFields ; 

OneOfStruct *OneOfValues ; 



float ..R3„geVa,ues_Y01 ; /- Acual va,u=. read in fro. nnn.X. flle, if MW ,s the firs. 
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and logp is the second value specified on the -rangevar argument list then 
RangeValues_Y01[n][0] would keep the value for MW for the nth line in the nnn.Xl file 
and RangeValues_Y01[n][l] would keep the value for logp for that line*/ 



float **RangeValues_Y02 ; /* same */ 

5 int -OneOfValues.YOl ; /^Actual values read from nnn.Xl files but translated into an 
index of OneOfValues[i].values so we dont have to waist memory and time doing 
strcmp*/int **OneOfValues_Y02 ; /* Same */ 
#endif 



int Totallnputs = 0 ; 
10 int Currentlnput = 0 ; 

FILE *fileHandles[255] ; 

char *InputNames[255] ; 

int Y_01_Length[255] ; 

int Y_02_Length[255] ; 
15 int BitMapStartPoint[255] ; 

int Remaininglnput[255] ; 



FILE *OutputFile; 



20 FILE 



*InputSourceFile; 



/* Code presumes that an int is 32 bits, ASCII-ed into %.8x format */ 
int **Y_01; /* fingerprints */ 

int **Y_02; /* " 
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*query 
nY 01 



/* number of structures */ 



nY 02 



*/ 



*cY_01; 



/* cardinality of fingerprints */ 
It 



unsigned char 
10 unsigned char 
double 
double 

int 
int 

15 int 
int 
int 
int 

int 

20 int 
int 

double 
double 

double 
25 int 



*cY^02; /* " 
c_query;/* " 

*iY_01; /* intersection count of fprints */ 
*iY_.02; /* " 

**X_01; /* topomers */ 

**X_02; /* " */ 
*iX_01; /* distance of topomers to selection */ 
*iX_02; /* " */ 

*Good_l; 

*Good_2; 

*Dead_l; 

*Dead_2; 

*Good_Products; 

*Dead_Products; 

nbits[256]; 

BigBits[65536]; 

setbits[8]; 

boundary [16]; 

Dist[16][16]; 

Distance = 80.0 ; 
XytesPerFingerPrint[2] ; 



int 
int 



nProcessed = 0; 
SomeLeft; 
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char next8[10] = •■01234567\0"; 

char next2[10] = "01\0"; 



char *ScreenFileName; 

char DefaultScreenFileName[321="$TA_MOLTABLES/standard.2DRULES' 



long 



waste time, trash_time, inTestBit, inActually; 
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Appendix "O" 
DB UTL 



^include <stdio.h> 
^include <signal.h> 
5 #include <ctype.h> 

^include <unistd.h> 

#include < string. h> 

#include <sys/stat.h> 

#include <math.h> 
10 ^include "parseopt.h" 

//include "utLstr.h" 

^include "utl_mem.h" 

#include "utl_file.h" 

^include "utl_math.h" 
15 ^include "ct.h" 

#include "ct_expr.h" 

^include "ct_proto.h" 

^include "import_proto.h" 

#include "io_fTprint.h" 

20 #include •'commonData.h" /* Globals use by most functions, we will clean this 

up soon */ 

static int Whatl = -1; 

static int What2 = 0 ; 

static int (*CalcFingerPrintFunc)(); 
25 static int (*ActuallyComputeFunc)(); 

extern FILE *debugFile ; 



int CountLines(fp) 
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FILE *fp ; 
{ 

int i; 

char *foo; 
5 i=0; 

while ( -1 != UTL_SCAN_GETS( fp, "W", &foo)) i+ + ; 

rewind(fp); 
return i; 

} 

10 intSelectEverything(inputSource,maxHits,whatFirst,calcFP,computeFP,zapNeighbors) 

char *inputSource ; 
int maxHits ; 
char *whatFirst ; 
int (*calcFP)0; 
15 int (*computeFP)0; 
int (*zapNeighbors)(); 

{ 

int cqt, q_lo, q_hi. U h carhold, inthold, onion, intsc; 
double max; 
20 int k ; 

int Y_01_Offset, Y_02_Offset ; 

int pos ; 

int numZapped = 0 ; 

CalcFingerPrintFunc = calcFP ; 
25 ActuallyComputeFunc = computeFP ; 

while (nProcessed < maxHits && ( SomeLeft > 0 ) ) 

{ 

/* 
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What we would like to do is first select any selections that were found 
** in a previous run. 

' if ( linputSource j i !( c_query = SelectFromInputFile(inputSource, query)) ) 

if (I (c_query = SelectIt(query,whatFirst) )) return u, 

nProcessed+ + ; 

SomeLeft—; 

RemainingInput[Cun-entInput]-- ; 

10 } 

/* then zap its neighbors and continue! */ 
(*zapNeighbors)(NULL,0,&numZapped, 1 ,Whatl ,What2); 



} /* while still stuff left */ 
return 1; 

15 } 

int TestBit(bitset, bit) 
int *bitset, bit; 
{ 

int what, this; 
20 unsigned char *bytes; 



bytes = (unsigned char *) bitset; 

what = bit % 8; 
this = bit / 8; 

return (bytes[this] & setbits[what] ); 

25 } 
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intZapInputProduct(inputIndex,whichProduct,indexl,index2) 
int inpuandex ; /* which input are we currently processing */ 
int whichProduct; /* This will be a direct index into the bitmap vector */ 
int indexl ; /* if whichProduct is not given we will calcualte it */ 
5 int index2 ; /* from these two values */ 
{ 

if ( IwhichProduct ) 

WhichProduct = indexl * Y_02_Length[inputIndex] + index2 ; 

WhichProduct += BitMapStartPoint[inputIndex] ; 
10 FlagProduct(Dead_Products,indexl, index2, whichProduct ); 

SomeLeft— ; 

Remaininglnput[inputlndex]- ; 

} 



int FlagProduct(TheProducts, index l,index2, this) 
15 int *TheProducts; 

int indexl, index2, this; 

{ 

int what; 

unsigned char *Products; 



20 /* if (DebugLevel) 

printf("%d %d, %d, %x\n",indexl,index2,this,TheProducts);*/ 

Products = (unsigned char *) TheProducts; 



25 



if (!this ) this = indexl*Y_02_Length[CurrentInputl + index2; 
what = this % 8; 
this /= 8; 



bit index */ 
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Products[this] | = setbits[what]; 
return 1; 
} 

int FlagReagent(TheReagent, size, index) 
5 int *TheReagent; 
int size, index; 
{ 

int what, this; 
unsigned char *Reagent; 

10 Reagent = (unsigned char *) TheReagent; 

what = index % 8; 
this = index / 8; 
Reagent[this] I = setbits[what]; 
return 1; 

15 } 

int SelectFromInputFile(inputSource,query) 
char *inputSource ; 
int *query ; 
{ 

20 static int firstTime = 1 ; 

static FILE *fp = (FILE *)NULL ; 

unsigned char *p, *q ; 
int index 1; 
int index2; 
25 int index ; 
char *line ; 
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char *cp ; 

unsigned char *queryPtr ; 
int which ; 
int i ; 

5 char *name = (char *)NULL ; 
char hold[81]; 
int Oldlndex ; 

int Y 01 Offset , Y_02_Offset ; 



Oldlndex = Currentlnput ; 



10 if ( firstTime ) 

{ 

if ( !(fp = fopen(inputSource,"r"))) 

goto UnableToOpenFile ; 
firstTime = 0 ; 

15 } 

if (.1 == UTL_SCAN_GETS( fp, "", &line)) return 0; 



if ( !( cp = strtok(line," ")) ) 

goto UnableToParseLine ; 

/* 

20 ** Hold on to this for now. 

name = UTL_STR_SAVE(cp); 

if ( ;( cp = strtok(NULL," ")) ) 

goto UnableToParseLine 



25 



index 1 = atoi(cp) - 1 ; 
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if ( !( cp = strtok(NULL," ")) ) 

goto UnableToParseLine ; 

index2 = atoi(cp) - 1 ; 




if (( indexl < 0 ) I I ( index2 < 0 ) ) 
5 goto UnableToParseLine ; 

I* 

** Must get the input from a file that we are processing now to work. 
*/ 

for ( Currentlnput = -1 , i = 0 ; i < Totallnputs ; 1+4- ) 

10 { 
#if 0 

which = indexl * Y_02_Length[i] + indexl ; 
sprintf(hold,"%s%d", InputNames[i], which+1) ; 
if ( UTL_STR_CMP_NOCASE(name,hold) = = 0 ) 

15 { 

Currentlnput = i ; 
break; 

} 



#endif 

20 

= = 0) 



if ( uTL_STR_NCMP_NOCASE(name,InputNames[i] ,strlen(InputNames[i])) 
{ 



Currentlnput = i ; 
break; 

25 } 
} 

if ( i > = Totallnputs ) 
goto Invalidlnput ; 

/* 

30 ** If we are reading back in a selection that might have already been filtered 
** out we better adjust our counts. 
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if (!TestBit(Dead_Products, 

BitMapStartPoint[CurrentInput] + 

index 1 *Y_02_Length[CurrentInput] 

{ 

nProcessed-f + ; 
SomeLeft--; 

RemainingInput[CurrentInput]" ; 

} 



index2 )) 



10 



15 



Y_01_Offset = Y_02_Offset = 0 ; 
for ( i = 0 ; i < Currentlnput ; i++ ) 

{ 

Y_01_Offset += Y_01_Length[i] ; 

Y_02_Offset + = Y_02_Length[i] ; 

} 



c_query = (*CalcFingerPrintFunc)(query, 



Y 01 Offset], 



20 Y 02 Offset ]); 



#if 0 

p = (unsigned char *) Y_01[indexl]; 
q = (unsigned char *) Y_02[index2]; 
c_query = 0; 
25 queryPtr = (unsigned char *)query ; 



Y_01 [index 1 + 
Y 02[index2 + 



for (index =0;index < BytesPerFingerPrint;index+ + ,queryPtr+ +) 



*queryPtr = *p++ I *q++ ; 
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c_query += nbits[*queryPtr & 255]; 

} 

#endif 



Oui 



tputThisHit(indexl,incIex2); /* both print it and note it in bitsets */ 



5 Currentlnput = Oldlndex ; /* reset this back to the begining *l 

if ( name ) 

UTL_MEM_FREE(name); 

return c_query; 

UnableToParseLine : 
10 fprintf(stderr, "Unable to Parse % s\n" ,line) ; 

return 0 ; 
UnableToOpenFile : 

fprintf(stderr, "Unable to open file %s\n",inputSource); 

return 0 ; 
15 Invalidlnput : 

fprintf(stderr, "Input %s does not match one of the -prefix filesXn", name); 
return 0 ; 

} 



CountFingerPrintBits(fingerPrint,length) 

20 int *fingerPrint ; 
int length ; 
{ 

int i ; 

int count = 0 ; 

25 for ( i = 0 ; i < length ; i+ + ) 

{ 
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count += nbits[fingerPrint[il & 255]; 

} 

return count ; 

} 

5 /* Here the intent is to select the next compound "intelligently". 
We try to maximize use of one or the other reagent. 

*/ 

int SelectIt(query,whatFirst) 
int *query; 
10 char *whatFirst ; 

{ 

int i,j; 

if (Whatl < 0) {GrabRandom( &i, &j, query); goto out; } 

switch (whatFirst[0]) 
15 { 

case '0': 

GrabRandom( &i, &j, query); 
break; 

case '1': 

20 GrabThis(&i, &j, 1, query); 

break; 

case '2': 

GrabThis( &i, &j, 2, query); 
break; 

25 } 

out: 

OutputThisHit(i,j); /* both print it and note it in bitsets */ 
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return c query; 

} 

int GrabThis( pi, p2, type, fp) 
int *pl, *p2, type, *fp; 
5 { 

unsigned char *p, *q, *pro; 
int index; 

int Y_01_Offset , Y_02_Offset ; 
int i ; 

10 int NY02 ; 
int NYOl ; 

while ( Currentlnput < = Totallnputs ) 
{ 

15 ** Process each one of the inputs walking down the data array. 
*/ 

NY02 = Y_02_Length[CurrentInput] ; 
NYOl = Y_01_Length[Currenanput] ; 
switch (type) 
20 { 

case 1: 

if (!findOne(Dead_Products, Whatl*NY02, 1, NY02) && 
!findOne(Dead_Products, What2 , NY02, NYOl) && 
!GrabRandom( pi, p2, fp) ) 

25 { 

Currentlnput + + ; 
continue ; 

} 
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break; 



case 2: 



if (!findOne(Dead_Products, What2 , NY02, NYOl) && 
!findOne(Dead_Products, Whatl*NY02, 1, NY02) && 
!GrabRandom( pi, p2, fp) ) 



Currentlnput++ ; 
continue ; 

} 

10 break; 

} 
/* 

If we are at the end of this input set, we need to advance to the 
** next one. 
15 */ 



I I 
I I 



if ( ( Whatl > = Y_01_Length[CurrentInput] ) 

( What2 > = Y_02_Length[CurrentInput] ) ) 



Whatl = 0 ; 
20 What2 = 0 ; 



Currentlnput4- + ; 
continue ; 



} 

break; 
25 } 



if ( Currentlnput = = Totallnputs ) 
return 0 ; 



*pl = Whatl; *p2 = What2; 

Y_01_Offset = Y_02_Offset = 0 ; 
for ( i = 0 ; i < Currentlnput ; i++ ) 

{ 
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Y_01_Offset += Y_01_Length[i] ; 
Y 02 Offset + = Y_02_Length[i] ; 



c query = (*CalcFingerPrintFunc)(fp, 
^ ~ Y_01[Whatl + 

Y 01_Offset ], 

Y_02[What2 + 

Y_02_Offset ]); 
#if 0 

10 pro = (unsigned char *) fp; 

p = (unsigned char *) ( Y.OlCWhatl + Y_0l_Offset] ) ; 
q = (unsigned char *) ( Y_02[What2 + Y_02_Offset] ) ; 

c_query = 0; 

for (index =0;index < BytesPerFingerPrinf,index+ + ,pro+ +) 

15 { *pro = *p++ I *q++ ; 

c_query += nbits[*pro & 255]; } 

#endif 
return 1; 

} 

20 /* This can be done more efficiently when we KNOW we are walking a vector */ 
int findOne(bitset, start, incr, max) 
int *bitset, start, incr, max; 
{ 

static int oldstart = -1234, 
25 oldincr, 
oldj; 

int i; 



if ( (start != oldstart) | | (incr !- oldincr) ) old J - -1 ; 
oldstart = start; oldincr = incr; 
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oldj ++; 

start + = incr * old_i; 

for (i=old_i;i<max;i + + , start += incr ) 

{ 

5 if ( TestBit(bitset, BitMapStartPoint[CurrentInput] + start)) continue; 
Whatl = start / Y_02_Length[CurrentInput] ; 
What2 = start % Y_02_Length[CurrentInput] ; 

oldJ = i; 
return 1; 

10 } 

oldstart = -1234; 
return 0; 

} 



/* 

15 **+E: 
** 



20 



Abstract : Function will randomly select a product from the current 
** input file, if there are no more selections left in the 

** current input then the next one is searched. 



Currently we deal with two reaction sites(two points of 
** variability Yl and Y2). A product is one of possible 

** combination of Y1.Y2. 

25 ** The bit maps that are used to track the selections/eliminated 

products are a vector of length numY01*numY02, where every 
bit represents a product. Since we are dealing with multiple 
** sins then we just use one bitmap vector and string the 

** representations for a set of product together. 



30 ** 

** 



[ reaction l(cslnl) products] [reaction2(csln2) products] ... 
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where each reaction products are layed out in a row major 
format : 

5 Y1(0).Y2(0), Y1(0).Y2(1) .. | Y1(1)Y2(0),Y1(1)Y2(1)... 

Usage : 

** Returns : 1 on success, 0 for failure. 

10 ** 

** Algorithms : None. 
** 

** Revision History : 

15 Modified to work with multiple csln 

processing and documented. 

** 

*/ 

20 int GrabRandom( pi, p2, fp) 
int *pl, *p2, *fp; 
{ 

int index, sum; 
int i ; 

25 int valuel , value2; 

unsigned char *p, *q, *pro; 

int byteOffset, bitsToSkip ; 
int Y_01_Offset , Y_02_Offset ; 
fptf ; 
30 /* 

Lets start at the begining portion of the bitmap for this input, our products 
** are layed out like : 
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** 



** global varaible Currentlnput tells us which reaction we are look at now 
and BitMapStartPointD has the starting points in the bitmap for each 
5 ** input. 



** We know how many products are left in the current input, if we get to 
** zero here we need to move on to the next input set. 
10 */ 

for ( i = Currentlnput ; i < Totallnputs ; i++ ) 
if ( RemainingInput[CurrentInput] > 0 ) 
break; 

else 

15 Currentlnput ++ ; 

if ( Currentlnput > = Totallnputs ) 
return 0 ; 

/* 

** Figure out which byte in the bitmap the products for this input start. 
20 **/ 

byteOffset = BitMapStartPoint[CurrentInput] / 8 ; 

p = (unsigned char *) ( Dead_Products ); 
p + = byteOffset ; 

index = (( f = UTL_MATH_RAND()) * RemainingInput[CurrentInput]) 
25 valuel = sum = 0; 

while (sum < index) 
{ 

sum += nbits[ ~(*p++) & 255]; 
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valuel += 8; 



} 



10 



p -= 1; sum -= nbits[ -(*p) &255 ]; valuel -= 9; value2 = (~(*p) & 255); 
while (sum < index) 
{ 

valuel + + ; 
if ( value2 & 1) 

sum+ +; 
value2 = value2 > > 1; 

} 

/* 



We found where our random(not selected) product is, now we need to go 
** back so many bits to be able to translate the address in a one dimentional 
** bitmap vector into a 2D index. 
15 ** (This is becuase our bitmap representation for this input did not start 
** from 0(or a byte boundary). 
*/ 

bitsToSkip = BitMapStartPoint[CurrentInput] - ( byteOffset * 8 ) ; 



valuel -= bitsToSkip ; 
20 What2 = ( valuel ) % Y_02_Length[CurrentInput]; 
Whatl = ( valuel ) / Y_02_Length[CurrentInput] 



*pl = Whatl ; 
*p2 = What2 ; 
/* 

25 ** Find out where the values for this product is. 
*/ 

Y_01_Offset = Y_02_Offset = 0 ; 
for ( i = 0 ; i < Currentlnput ; i + + ) 

{ 

30 Y 01_Offset += Y_01_Length[i] ; 
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Y_02_Offset + = Y_02_Length[i] ; 

} 

c_query = (*CalcFingerPrintFunc)(fp, 

Y_01[Whatl + 

5 Y_01_Offset ], 

Y_02[What2 + 

Y_02_Offset ]); 
#if 0 

pro = (unsigned char *) fp; 
10 p = (unsigned char *) ( Y_01[Whatl + Y_01_Offset ] ) ; 
q = (unsigned char *) ( Y_02[What2 + Y_02_Offset ] ) ; 



/* 

** Calculate the approximate fingure print by ORing the fmgure print 

** for the two pieces. 

15 */ 

c_query = 0; 

for (index=0;index<BytesPerFingerPrint;index+ + ,pro++) 
{ *pro = *p++ I *q++ ; 

c_query += nbits[*pro & 255]; } 

20 #endif 

return 1; 

} 



int OutputThisHitC index 1, index2) 
int index 1, index2; 
25 { 

int which; 



which = index 1 * Y_02_Length[CurrentInput] + index2; 
fprintf(OutputFile,"%s%d %d %d\n", InputNames[CurTentInput], which+1, 
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indexl + 1 ,index2 + l); 
which = BitMapStartPoint[CurrentInput] + index 1 *Y_02_Length[CurrentInput] 

+index2 ; 

FlagProduct(Good_Products,0,0, which); 
5 FlagProduct(Dead_Products,0,0, which); I* can only be selected once */ 
/* note use of reagents; this is slightly wasteful of time */ 
FlagReagent(Good_l, nY_01, indexl); 
FlagReagent(Good_2, nY_02, index2); 



fflush(OutputFile); 
10 return 1; 
} 

int DumpBitSet(bitSet,offset,numY01 ,numY02) 
int *bitSet ; 
int offset ; 
15 int numYOl ; 
int numY02 ; 

{ 

int i , j ; 

unsigned char ^Products = (unsigned char *)bitSet ; 
20 int pos ; 
int byte ; 
int bit ; 
int indexl ; 
int index2 ; 



25 fprintf(stderr,"\n Y_02 

"); 

for ( i = 0 ; i < numY02 ; i++ ) 
fprintf(stderr," %3d ",i+l); 
fprintf(stderr,"\n- 
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"); 

for ( i = 0 ; i < (numYOl * numY02) ; i + + ) 

{ 

5 index 1 = i / numY02 ; 

inclex2 = i % numY02 ; 

byte = ( i + offset ) / 8 ; 
bit = ( i + offset ) % 8 ; 

if (( ( i % numY02 ) = = 0 ) ) 

fprintf(stderr,"\n%3d 1", index 14-1); 
fprintf(stderr," %3d ",(Products[byte] & setbits[bit])?l:0 ); 

} 

fprintf(stderr,"\n 

} 

15 int DumpValues(inputSet,numY01 ,num Y02,computeFunc) 
int inputSet ; 
int numYOl ; 
int numY02 ; 
int (*computeFunc)(); 

20 { 

int i , j ; 

int pos ; 

int byte ; 

int bit ; 
25 int index 1 ; 

int index2 ; 

int onion ; 

int intsc ; 

double max ; 





int Y2 Offset ; 



int Yl Offset ; 



^rintf(stderr,"\n 



Y 02 



An 



); 



5 



for ( i = 0 ; i < numY02 ; i++ ) 



fprintf(stderr, " %5d " ,i + 1) ; 
fprintf(stderr,"\n 

"); 

for ( Yl_Offset = Y2_0ffset = i = 0 ; i < inputSet ; i++ ) 



index 1 = Yl_0ffset + i / numY02 ; 
index2 = Y2_Offset + i % numY02 ; 

(*computeFunc)( index 1, index2, &onion, &intsc, &max); 

if (( ( index2 - Y2_Offset ) = = 0 ) ) 

fprintf(stderr,"\n%5d | " ,indexl + 1- Y l_Offset ); 

fprintf(stderr," %0.3f ",max); 

} 

fprintf(stderr,"\n 



10 { 



Yl_Offset += Y_01_Length[i]; 
Y2 Offset += Y_02_Length[i]; 



for ( i = 0 ; i < (numYOl * numY02) ; i++ ) 



15 



DumpBitSet(Dead_Products,BitMapStartPoint[inputSet], numYOl, numY02); 
25 } 
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Appendix "P" 
ELIMATE 



#include <stdio.h> 

#include < signal. h> 
5 #include <ctype.h> 

^include <unistd.h> 

#include < string. h> 

#include <sys/stat.h> 

#include <math.h> 
10 #include "parseopt.h" 

^include "utl_str.h" 

#include "utl_mem.h" 

#include "utl_file.h" 

^include "utl^math.h" 
15 #include "ct.h" 

#include "ct_expr.h" 

^include "ct^proto.h" 

^include "import_proto.h" 

^include "io_fprint.h" 

20 ^include "dservTypes.h" 
/* 

**+E: 

25 Abstract : Function zapps products who are in the same neighborhood 
** as the SLNs in the Unity hitlist files, 

** 
** 

Usage : 

30 ** 
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** Returns : 1 on success, 0 on error 
** 

** Algorithms : None. 
5 ** Revision History : 

10 int EliminateProductsFromDatabase(DatabaseNames,numEliminated,zapNeighbors) 

char *DatabaseNames ; 
int *numEliminated ; 
int (*zapNeighbors)0; 
{ 

15 int i ; 
char *cp ; 

struct loDataBase ^database = 0 ; 
struct loFingerPrint *fingerPrint = 0 ; 
struct loFingerPrintlnfo fprintlnfo; 
20 struct loFprintDef *fprintDef=0; 



int fingerPrintFile ; 

int bytesPerFingerPrint ; 

long slnld ; 

char *databaseDirectory ; 

25 char *databaseName ; 

int numZapped = 0 ; 

int c_query ; 

*numEliminated = 0 ; 
if ( DatabaseNames ) 
30 { 

cp = strtok(DatabaseNames," "); 
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while ( cp ) 
{ 



/* 

** Open the database. 
5 */ 

databaseDirectory = dimame( cp ); 

databaseDirectory = databaseDirectory?databaseDirectory: 
databaseName = basename(cp,(char*)0); 



if ( !( database = DB_IO_DBSTART_USER_PSWD( 



10 databaseDirectory, 



15 



databaseName, 
"r", 

NULL, 
NULL, 
0, 

0)) ) 



goto UnableToOpenDatabase 



if( !(fprintDef = DB_IO_FPDEF_VREAD( database, "standard" ))) 
goto NoSuchScreen; 

20 if ( !DB_IO_FPRINT_GETINFO( database, 

fprintDef- > fpdFprintDir, 
fprintDef- > fpdFprintFileName, 

&fprintInfo ) ) 

goto UnableToGetScreenlnfo ; 
25 if( !(fingerPrintFile = DB_IO_FPRINT_OPEN( database, 
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fprintDef- > fpdFprintDir, 
fprintDef->fpdFprintFileName ))) 

goto NoSuchScreen; 

bytesPerPingerPrint = fprintlnfo.fpFingerLength / 8 ; 
5 fprintf(stderT, "Processing database %s %d\n",cp,bytesPerFingerPrint); 

if ( (fprintlnfo.fipFingerLength % 8 ) ) 
bytesPerFingerPrint + + ; 

fingerprint = (struct loFingerPrint *) UTL_MEM_ ALLOC ( 

sizeof (*fingerPrint) + bytesPerFingerPrint 4- 1 ); 

10 /* 

** Read all compounds in the database. 
*/ 

for ( slnId=fprintInfo.fpStartSlnNo; 

slnld< =fprintInfo.fpLastSlnNo; 
15 slnld + +) 

{ 

if ( !DB_IO_FPRINT_READ ( database, 

fingerPrintFile, 
(long) slnld, 

20 fingerprint ) ) 

goto 

UnableToReadFromDatabase; 
/* 

** Zap all the neighbors in the current run. 
25 */ 

c_query = CountFingerPrintBits(&fingerPrint- > fpPrint, 

bytesPerFingerPrint) ; 

fprintf(stderr, "Reading sin %d\r",slnld-f- -f); 
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(*zapNeighbors)(&fingerPrint- > fpPrint,c_query,&numZapped,0,-l ,-1); 

(*numEliminated) += numZapped ; 

} 

/* 

5 ** CLOSING DATABASES TRASHES MEMORY, DO THIS AFTER YOU CAN 
SPEND SOME 

** TIME TO DEBUG THIS PROBLEM. 

** F.S. 05-14-96 

*/ 

10 #if 0 

/* 

** Close the database and do it again! 
*/ 

DB_IO_DBCLOSE(database) ; 

15 /* 

** Get the next database to process. 
*/ 

if ( fingerprint ) 

UTL_MEM_FREE( fingerPrint ); 
20 DB_IO_FPRINT_CLOSE( database, fmgerPrintFile ); 

#endif 

cp = strtok(NULL," "); 

} 

} 

25 fprintf(stderr,"\n"); 

return 1 ; 

UnableToOpenDatabase : 

fprintf(stderr, "Unable to open database %s\n",cp); 
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goto Error ; 
NoSuchScreen: 

fprintf(stderr, "Unable to open screen 'standard'\n"); 

goto Error ; 
5 UnableToGetScreenlnfo: 

fprintf(stderr, "Unable to read screen information\n"); 

goto Error ; 
UnableToReadFromDatabase: 

fprintf(stderr, "Unable to read fingerprint for sin id %d\n",slnld); 
10 goto Error ; 

Error : 

return 0 ; 

} 

/* 

15 **+E: 

** Abstract : Function zapps products who are in the same neighborhood 
as the SLNs in the Unity hitlist files. 

20 ** 

** Usage : 

** Returns : 1 on success, 0 on error 
25 ** 

** Algorithms : None. 



30 



**-E: 
*/ 
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int EliminateProductsFromHitlist(HitlistNames,screenName,numEliminated,zapNeighbors) 

char *HitlistNames ; 
char *screenName ; 
int *numEliminated ; 
5 int (*zapNeighbors)0; 
{ 

int i ; 
char *cp ; 

struct loFingerPrint *fingerPrint 0 ; 



10 FILE *fmgerPrintFile = 0 ; 

int bytesPerFingerPrint ; 

long slnld ; 

char *databaseDirectory ; 

char *databaseName ; 

15 int numZapped = 0 ; 

struct CtConnectionTable *ct; 

int nBitsSet; 

char *sln ; 

FILE *handle ; 

20 int c_query ; 

int *ScreenStructure; 



*numEliminated — 0 ; 
if ( IHitlistNames ) 
return 1 ; 

25 /* 

** Read in the screen information first. 
*/ 

if (!(fingerPrintFile = UTL_FILE__FOPEN(screenName,"r"))) 
goto UnableToOpenFingureprintFile ; 
30 ScreenStructure = (int *) DB_BIT2_PARSE_2DSCREEN(fingerPrintFile); 

UTL_FILE_FCLOSE(fingerPrintFile); 
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if (IScreenStructure) 

goto UnableToReadScreenStructure ; 

bytesPerFingerPrint = DB_BIT2 GET_SIZE( ScreenStructure ); 

fingerprint = (struct loFingerPrint *) UTL_MEM_ALLOC( bytesPerFingerPrint); 

*nuniEliminated = 0 ; 
if ( HitlistNames ) 

{ 

cp = strtok(HitlistNames," "); 
while ( cp ) 

{ 

slnld = 0 ; 
fprintfCstderr, "Processing hitlist %s\n",cp); 
/* 

** Open the database. 
*/ 

if ( ! (handle = fopen(cp,"r")) ) 

goto UanbleToOpenHitlist ; 

/* 

** Read all the hits in the hitlist. 
*/ 

while ( UTL_SCAN_GETS ( (FILE *) handle, "W", &sln) ! = 

) 

{ 

if (!(ct = DB_IMPORT_SLN(sln))) 

goto UnableToGetCtFromSln ; 
memset ( fingerPrint, 0, bytesPerFingerPrint ); 
if( !DB_BIT2_EVALUATE( ct, 

ScreenStructure, 
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fingerprint, 
&nBitsSet )) 
goto UnableToGenerateFingerprint ; 

/* 

5 ** Zap all the neighbors in the current run. 
*/ 

c query = 0 ; 
c_query = 

CountFingerPrintBits(&fmgerPrint- > fjpPrint.bytesPerFingerPrint); 
10 fiprintf(stderr, "Reading sin %d\r",slnld++); 



(*zapNeighbors)(&fingerPrint- > fpPrint,c_query,«&numZapped,0,-l ,-1)". 

(*numEliminated) += numZapped ; 

} 

/* 

15 ** Close the database and do it again! 
*/ 

fclose(handle); 

/* 

** Get the next hitlist to process. 
20 */ 

cp = strtok(NULL," "); 

} 

} 

fprintf(stderr,"\n"); 

25 if ( fingerprint ) 

UTL MEM_FREE( fingerPrint ); 



return 1 ; 
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UnableToOpenFingureprintFile : 

fprintf(stderr, "Unable to open fingure print file %s\n",screenNaine); 

goto Error ; 
UnableToReadScreenStructure : 



goto Error ; 
UanbleToOpenHitlist : 

fprintf(stderr," Unable to open hitlist %s\n",cp); 

goto Error ; 
10 UnableToGenerateFingerprint: 

fprintf(stderr," Unable to generate fingureprint for \n%s\n",sln); 

goto Error ; 
UnableToGetCtFromSln: 

fprintf(stderr, "Unable to generate ct for \n%s\n",sln); 
15 goto Error ; 



5 



fprintf(stderr, "Unable read screen info for %s\n",screenName); 



Error : 



return 0 ; 



} 



